MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Calibre (https://www.mobileread.com/forums/forumdisplay.php?f=166)
-   -   lit2oeb -- calibre LIT extraction/conversion without ConvertLIT (https://www.mobileread.com/forums/showthread.php?t=26857)

llasram 07-25-2008 11:10 AM

lit2oeb -- calibre LIT extraction/conversion without ConvertLIT
 
1 Attachment(s)
Kovid pushed out a new version of calibre last night (0.4.80) which packs an old feature in new clothes: I've ported (most of) ConvertLIT to Python and calibre is now able to extract the contents of LIT files directly, without having a copy of ConvertLIT installed. Edit: As of version 0.4.83, the calibre-native code is the default, and may be accessed on the command-line as 'lit2oeb' (for just explosion) or as part of LRF conversion with 'lit2lrf'.

The calibre-native code fixes the following bugs in ConvertLIT:
  • All footnote, etc hyperlinks should be correct. ConvertLIT would frequently create a hyperlink to an incorrect file sharing a filename common prefix.
  • There should be no extraneous spaces. ConvertLIT attempts to pretty-print HTML as it extracts it, but frequently inserts whitespace where it doesn't belong.
  • Technically malformed books from Penguin should extract properly. At least some books from Penguin are broken in a way which causes ConvertLIT to fail even though Microsoft Reader handles them gracefully.
  • Correctly handles LIT files containing files with very long filenames. ConvertLIT will report a confusing UTF-8 decode error in these situations. (This bug just fixed, will be in calibre 0.4.81.)

"Ah!," you ask, "but what bugs does your new code introduce, other than being rather slow right now?"

Well that's where you, the savvy early-adopter, come in: we need to find them! If you (a) have a fair number of LIT e-books and (b) can run a command from the command-line, please download the attached Python script and run it against your library. The arguments are the filename of a logfile to write out to and the directory to search for LIT files in. For example:
Code:

python stress-lit2oeb.py log.txt library/
If the script reports interesting results (i.e., bugs) please e-mail me the log-file.

If you instead / then just use 'lit2oeb' or 'lit2lrf --lit2oeb' on individual files and find individual bugs, please use the calibre issue-tracker as per usual: check if anyone else has already posted the same bug, and if not post a new defect issue.

Thanks, and I hope you find this useful!

-Marshall

P.S. In case it isn't obvious, the calibre LIT code does not include DRM removal. You'll still need ConvertLIT for that if you want to do such things, but there are no known bugs there.

jmurphy 07-29-2008 01:44 PM

Which version of ConverLIT is the python code based on?

Is it possible for you to back-port your fixes back into ConvertLIT?
Granted, getting it into the "official" version might be difficult, but what about posting (here) a diff against the latest sources?

jmurphy

llasram 07-29-2008 03:16 PM

Quote:

Originally Posted by jmurphy (Post 224195)
Which version of ConverLIT is the python code based on?

ConvertLIT 1.8, the most recent version available from the official site.

Quote:

Is it possible for you to back-port your fixes back into ConvertLIT?
Granted, getting it into the "official" version might be difficult, but what about posting (here) a diff against the latest sources?
It would certainly be possible, and I did post a patch for the hyperlink bug when I found it, but I'm not sure of the benefit. Getting the changes into the "official" version would seem at this point to be less difficult than impossible -- there haven't been any updates to the ConvertLIT site in 4 years and the maintainer hasn't been responding to e-mail. Someone else could take over the project, but with the official site still up and the maintainer MIA, it would be a competing project anyway.

Is there something stopping you from being able to just migrate to calibre for all your LIT-extraction needs?

wallcraft 07-29-2008 03:53 PM

Quote:

Originally Posted by llasram (Post 224240)
Someone else could take over the project, but with the official site still up and the maintainer MIA, it would be a competing project anyway.

Not to mention the risk of going to straight to jail if the new maintainer ever visits the US. There is a similar risk posting a diff against the original source code. The changes are not DRM-related, but they are updating a DRM-cracking program and so risk falling foul of the DMCA.

IceHand 08-03-2008 10:12 AM

Quote:

Originally Posted by llasram (Post 221624)
  • There should be no extraneous spaces. ConvertLIT attempts to pretty-print HTML as it extracts it, but frequently inserts whitespace where it doesn't belong.

Nice! I had a LIT file where ConvertLIT had this problem.
However, the downside of your change is that the resulting HTML file often has very long lines and is hard to read. Two suggestions:

1. Automatically replace "> <" with ">\n<". Notice the space between > and <. (\n = line break) I suggested this for mobi2oeb too and it has been accepted.

2. Make line breaks where it's safe to do them, e.g. after "</p>" and "</h1>" ...

This is true for the resulting OPF as well, by the way.

Nice work so far, I'll use your script to hunt down bugs.

jmurphy 08-03-2008 03:49 PM

Quote:

Originally Posted by llasram (Post 221624)
If you (a) have a fair number of LIT e-books and (b) can run a command from the command-line, please download the attached Python script and run it against your library. The arguments are the filename of a logfile to write out to and the directory to search for LIT files in. For example:
Code:

python stress-lit2oeb.py log.txt library/
If the script reports interesting results (i.e., bugs) please e-mail me the log-file.

I've got 4,000 lit files.
How do you run this on Windows? I've got Python installed. When I run the script I get:

Code:

Traceback (most recent call last):
  File "stress-lit2oeb.py", line 8, in <module>
    from calibre.ebooks.lit.reader import LitReader
ImportError: No module named calibre.ebooks.lit.reader

I know, it's probably obvious, but....

llasram 08-03-2008 09:12 PM

Quote:

Originally Posted by IceHand (Post 227320)
However, the downside of your change is that the resulting HTML file often has very long lines and is hard to read.

The problem with the '> <' to '>\n<' trick is that most LIT files don't actually contain such whitespace. (In fact, I was pretty surprised when Mobipocket books did -- I think it must be due to a quirk of their rendering engine.) Inserting a newline after block-level elements like <h1/> and <p/> will probably usually be safe, but it's possible (if crazy) to have CSS like 'h1 { display: inline; }' which would make it no longer safe.

How would you feel about an option to run the markup through a pretty-printer on output?

llasram 08-03-2008 09:22 PM

Quote:

Originally Posted by jmurphy (Post 227476)
I've got 4,000 lit files.
How do you run this on Windows? I've got Python installed. When I run the script I get:

Code:

Traceback (most recent call last):
  File "stress-lit2oeb.py", line 8, in <module>
    from calibre.ebooks.lit.reader import LitReader
ImportError: No module named calibre.ebooks.lit.reader

I know, it's probably obvious, but....

Actually, not so obvious :). You need to run this with your PYTHONPATH/sys.path including your cailbre install... Try running this script using 'calibre-debug' instead of 'python'? In the meantime I'll be getting this to work under Windows and will report back if something else is necessary. (Or maybe Kovid will pipe in?)

kovidgoyal 08-03-2008 10:33 PM

On windows, you can try something like:

Code:

calibre-debug
__name__ = 'int'
execfile('stress-lit2oeb.py', globals())
main(['stress', 'log.txt', 'path to directory with lit files'])


IceHand 08-04-2008 07:24 AM

Quote:

Originally Posted by llasram (Post 227623)
How would you feel about an option to run the markup through a pretty-printer on output?

You mean something like HTML Tidy? I just tried it and it works great with the options "tidy -utf8 -w -asxhtml -m '$1'". So yes, I think that would be a good idea.

llasram 08-04-2008 10:49 AM

Quote:

Originally Posted by IceHand (Post 227822)
You mean something like HTML Tidy? I just tried it and it works great with the options "tidy -utf8 -w -asxhtml -m '$1'". So yes, I think that would be a good idea.

The re-formatting part of 'tidy', yep, just not the markup-cleaning part. Which is probably obvious. Just being pedantic over here. Mmm.... Pedantic.

kovidgoyal 08-07-2008 11:16 PM

As of version 0.4.83, lit2oeb powers lit2lrf

junkml 08-08-2008 10:23 PM

The only downside of using lit2oeb instead of convertlit is that with convertlit you didn't have to go through multiple steps to load a .lit format book. Convertlit would work with Calibre to do everything in one step. (For those people who wanted to buy DRM'ed ebooks to load - strictly in theory, of course)

kovidgoyal 08-09-2008 02:53 PM

Calibre has a policy of not removiing DRM. And if it didn't addind DRM stripping to lit2oeb would be trivial.

junkml 08-09-2008 05:15 PM

Quote:

Originally Posted by kovidgoyal (Post 231576)
Calibre has a policy of not removiing DRM. And if it didn't addind DRM stripping to lit2oeb would be trivial.

Didn't mean to come across as wanting you to code DRM stripping into Calibre, Kovid.

The last thing anyone wants is for anything to cause Calibre to run into anything that might cause it to be shut down. That certainly means that DRM stripping can't be a direct part of the application. Your application is WAY to useful to put at risk!


All times are GMT -4. The time now is 08:19 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.