MobileRead Forums - View Single Post - lit2oeb -- calibre LIT extraction/conversion without ConvertLIT

llasram · 07-25-2008, 11:10 AM

Kovid pushed out a new version of calibre last night (0.4.80) which packs an old feature in new clothes: I've ported (most of) ConvertLIT to Python and calibre is now able to extract the contents of LIT files directly, without having a copy of ConvertLIT installed. Edit: As of version 0.4.83, the calibre-native code is the default, and may be accessed on the command-line as 'lit2oeb' (for just explosion) or as part of LRF conversion with 'lit2lrf'.

The calibre-native code fixes the following bugs in ConvertLIT:

All footnote, etc hyperlinks should be correct. ConvertLIT would frequently create a hyperlink to an incorrect file sharing a filename common prefix.
There should be no extraneous spaces. ConvertLIT attempts to pretty-print HTML as it extracts it, but frequently inserts whitespace where it doesn't belong.
Technically malformed books from Penguin should extract properly. At least some books from Penguin are broken in a way which causes ConvertLIT to fail even though Microsoft Reader handles them gracefully.
Correctly handles LIT files containing files with very long filenames. ConvertLIT will report a confusing UTF-8 decode error in these situations. (This bug just fixed, will be in calibre 0.4.81.)

"Ah!," you ask, "but what bugs does your new code introduce, other than being rather slow right now?"

Well that's where you, the savvy early-adopter, come in: we need to find them! If you (a) have a fair number of LIT e-books and (b) can run a command from the command-line, please download the attached Python script and run it against your library. The arguments are the filename of a logfile to write out to and the directory to search for LIT files in. For example:

Code:

python stress-lit2oeb.py log.txt library/

If the script reports interesting results (i.e., bugs) please e-mail me the log-file.

If you instead / then just use 'lit2oeb' or 'lit2lrf --lit2oeb' on individual files and find individual bugs, please use the calibre issue-tracker as per usual: check if anyone else has already posted the same bug, and if not post a new defect issue.

Thanks, and I hope you find this useful!

-Marshall

P.S. In case it isn't obvious, the calibre LIT code does not include DRM removal. You'll still need ConvertLIT for that if you want to do such things, but there are no known bugs there.