Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-25-2008, 10:10 AM   #1
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 622
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
lit2oeb -- calibre LIT extraction/conversion without ConvertLIT

Kovid pushed out a new version of calibre last night (0.4.80) which packs an old feature in new clothes: I've ported (most of) ConvertLIT to Python and calibre is now able to extract the contents of LIT files directly, without having a copy of ConvertLIT installed. Edit: As of version 0.4.83, the calibre-native code is the default, and may be accessed on the command-line as 'lit2oeb' (for just explosion) or as part of LRF conversion with 'lit2lrf'.

The calibre-native code fixes the following bugs in ConvertLIT:
  • All footnote, etc hyperlinks should be correct. ConvertLIT would frequently create a hyperlink to an incorrect file sharing a filename common prefix.
  • There should be no extraneous spaces. ConvertLIT attempts to pretty-print HTML as it extracts it, but frequently inserts whitespace where it doesn't belong.
  • Technically malformed books from Penguin should extract properly. At least some books from Penguin are broken in a way which causes ConvertLIT to fail even though Microsoft Reader handles them gracefully.
  • Correctly handles LIT files containing files with very long filenames. ConvertLIT will report a confusing UTF-8 decode error in these situations. (This bug just fixed, will be in calibre 0.4.81.)

"Ah!," you ask, "but what bugs does your new code introduce, other than being rather slow right now?"

Well that's where you, the savvy early-adopter, come in: we need to find them! If you (a) have a fair number of LIT e-books and (b) can run a command from the command-line, please download the attached Python script and run it against your library. The arguments are the filename of a logfile to write out to and the directory to search for LIT files in. For example:
Code:
python stress-lit2oeb.py log.txt library/
If the script reports interesting results (i.e., bugs) please e-mail me the log-file.

If you instead / then just use 'lit2oeb' or 'lit2lrf --lit2oeb' on individual files and find individual bugs, please use the calibre issue-tracker as per usual: check if anyone else has already posted the same bug, and if not post a new defect issue.

Thanks, and I hope you find this useful!

-Marshall

P.S. In case it isn't obvious, the calibre LIT code does not include DRM removal. You'll still need ConvertLIT for that if you want to do such things, but there are no known bugs there.
Attached Files
File Type: zip stress-lit2oeb.zip (924 Bytes, 297 views)

Last edited by llasram; 08-09-2008 at 09:51 PM. Reason: Updated status of code in 'lit2lrf'
llasram is offline   Reply With Quote
Old 07-29-2008, 12:44 PM   #2
jmurphy
Junior Member
jmurphy began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Sep 2007
Device: ipaq
Which version of ConverLIT is the python code based on?

Is it possible for you to back-port your fixes back into ConvertLIT?
Granted, getting it into the "official" version might be difficult, but what about posting (here) a diff against the latest sources?

jmurphy
jmurphy is offline   Reply With Quote
Old 07-29-2008, 02:16 PM   #3
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 622
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by jmurphy View Post
Which version of ConverLIT is the python code based on?
ConvertLIT 1.8, the most recent version available from the official site.

Quote:
Is it possible for you to back-port your fixes back into ConvertLIT?
Granted, getting it into the "official" version might be difficult, but what about posting (here) a diff against the latest sources?
It would certainly be possible, and I did post a patch for the hyperlink bug when I found it, but I'm not sure of the benefit. Getting the changes into the "official" version would seem at this point to be less difficult than impossible -- there haven't been any updates to the ConvertLIT site in 4 years and the maintainer hasn't been responding to e-mail. Someone else could take over the project, but with the official site still up and the maintainer MIA, it would be a competing project anyway.

Is there something stopping you from being able to just migrate to calibre for all your LIT-extraction needs?
llasram is offline   Reply With Quote
Old 07-29-2008, 02:53 PM   #4
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,979
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3 and Fire
Quote:
Originally Posted by llasram View Post
Someone else could take over the project, but with the official site still up and the maintainer MIA, it would be a competing project anyway.
Not to mention the risk of going to straight to jail if the new maintainer ever visits the US. There is a similar risk posting a diff against the original source code. The changes are not DRM-related, but they are updating a DRM-cracking program and so risk falling foul of the DMCA.
wallcraft is offline   Reply With Quote
Old 08-03-2008, 09:12 AM   #5
IceHand
Linux User
IceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheese
 
IceHand's Avatar
 
Posts: 309
Karma: 1082
Join Date: Aug 2007
Location: Germany
Device: Kindle 3
Quote:
Originally Posted by llasram View Post
  • There should be no extraneous spaces. ConvertLIT attempts to pretty-print HTML as it extracts it, but frequently inserts whitespace where it doesn't belong.
Nice! I had a LIT file where ConvertLIT had this problem.
However, the downside of your change is that the resulting HTML file often has very long lines and is hard to read. Two suggestions:

1. Automatically replace "> <" with ">\n<". Notice the space between > and <. (\n = line break) I suggested this for mobi2oeb too and it has been accepted.

2. Make line breaks where it's safe to do them, e.g. after "</p>" and "</h1>" ...

This is true for the resulting OPF as well, by the way.

Nice work so far, I'll use your script to hunt down bugs.
IceHand is offline   Reply With Quote
Old 08-03-2008, 02:49 PM   #6
jmurphy
Junior Member
jmurphy began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Sep 2007
Device: ipaq
Quote:
Originally Posted by llasram View Post
If you (a) have a fair number of LIT e-books and (b) can run a command from the command-line, please download the attached Python script and run it against your library. The arguments are the filename of a logfile to write out to and the directory to search for LIT files in. For example:
Code:
python stress-lit2oeb.py log.txt library/
If the script reports interesting results (i.e., bugs) please e-mail me the log-file.
I've got 4,000 lit files.
How do you run this on Windows? I've got Python installed. When I run the script I get:

Code:
Traceback (most recent call last):
  File "stress-lit2oeb.py", line 8, in <module>
    from calibre.ebooks.lit.reader import LitReader
ImportError: No module named calibre.ebooks.lit.reader
I know, it's probably obvious, but....
jmurphy is offline   Reply With Quote
Old 08-03-2008, 08:12 PM   #7
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 622
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by IceHand View Post
However, the downside of your change is that the resulting HTML file often has very long lines and is hard to read.
The problem with the '> <' to '>\n<' trick is that most LIT files don't actually contain such whitespace. (In fact, I was pretty surprised when Mobipocket books did -- I think it must be due to a quirk of their rendering engine.) Inserting a newline after block-level elements like <h1/> and <p/> will probably usually be safe, but it's possible (if crazy) to have CSS like 'h1 { display: inline; }' which would make it no longer safe.

How would you feel about an option to run the markup through a pretty-printer on output?
llasram is offline   Reply With Quote
Old 08-03-2008, 08:22 PM   #8
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 622
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by jmurphy View Post
I've got 4,000 lit files.
How do you run this on Windows? I've got Python installed. When I run the script I get:

Code:
Traceback (most recent call last):
  File "stress-lit2oeb.py", line 8, in <module>
    from calibre.ebooks.lit.reader import LitReader
ImportError: No module named calibre.ebooks.lit.reader
I know, it's probably obvious, but....
Actually, not so obvious . You need to run this with your PYTHONPATH/sys.path including your cailbre install... Try running this script using 'calibre-debug' instead of 'python'? In the meantime I'll be getting this to work under Windows and will report back if something else is necessary. (Or maybe Kovid will pipe in?)
llasram is offline   Reply With Quote
Old 08-03-2008, 09:33 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,422
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
On windows, you can try something like:

Code:
calibre-debug
__name__ = 'int'
execfile('stress-lit2oeb.py', globals())
main(['stress', 'log.txt', 'path to directory with lit files'])
kovidgoyal is offline   Reply With Quote
Old 08-04-2008, 06:24 AM   #10
IceHand
Linux User
IceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheeseIceHand can extract oil from cheese
 
IceHand's Avatar
 
Posts: 309
Karma: 1082
Join Date: Aug 2007
Location: Germany
Device: Kindle 3
Quote:
Originally Posted by llasram View Post
How would you feel about an option to run the markup through a pretty-printer on output?
You mean something like HTML Tidy? I just tried it and it works great with the options "tidy -utf8 -w -asxhtml -m '$1'". So yes, I think that would be a good idea.
IceHand is offline   Reply With Quote
Old 08-04-2008, 09:49 AM   #11
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 622
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by IceHand View Post
You mean something like HTML Tidy? I just tried it and it works great with the options "tidy -utf8 -w -asxhtml -m '$1'". So yes, I think that would be a good idea.
The re-formatting part of 'tidy', yep, just not the markup-cleaning part. Which is probably obvious. Just being pedantic over here. Mmm.... Pedantic.
llasram is offline   Reply With Quote
Old 08-07-2008, 10:16 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,422
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
As of version 0.4.83, lit2oeb powers lit2lrf
kovidgoyal is offline   Reply With Quote
Old 08-08-2008, 09:23 PM   #13
junkml
Addict
junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.
 
junkml's Avatar
 
Posts: 277
Karma: 1004969
Join Date: Mar 2007
Device: Sony Reader
The only downside of using lit2oeb instead of convertlit is that with convertlit you didn't have to go through multiple steps to load a .lit format book. Convertlit would work with Calibre to do everything in one step. (For those people who wanted to buy DRM'ed ebooks to load - strictly in theory, of course)
junkml is offline   Reply With Quote
Old 08-09-2008, 01:53 PM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,422
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Calibre has a policy of not removiing DRM. And if it didn't addind DRM stripping to lit2oeb would be trivial.
kovidgoyal is offline   Reply With Quote
Old 08-09-2008, 04:15 PM   #15
junkml
Addict
junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.junkml ought to be getting tired of karma fortunes by now.
 
junkml's Avatar
 
Posts: 277
Karma: 1004969
Join Date: Mar 2007
Device: Sony Reader
Quote:
Originally Posted by kovidgoyal View Post
Calibre has a policy of not removiing DRM. And if it didn't addind DRM stripping to lit2oeb would be trivial.
Didn't mean to come across as wanting you to code DRM stripping into Calibre, Kovid.

The last thing anyone wants is for anything to cause Calibre to run into anything that might cause it to be shut down. That certainly means that DRM stripping can't be a direct part of the application. Your application is WAY to useful to put at risk!
junkml is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Book Error In Sigil After Calibre Conversion (from lit to epub) Guns4Hire Sigil 13 03-05-2010 05:02 PM
.lit conversion bubulac Calibre 0 01-07-2010 11:33 PM
problem using convertlit & Calibre Gravitas Sony Reader 5 09-25-2008 04:43 AM
ConvertLit GUI: Secure LIT for Reader? Michele Sony Reader 21 03-18-2008 03:52 PM
LIT conversion (C#) developer Jaapjan Workshop 35 09-26-2005 09:43 AM


All times are GMT -4. The time now is 03:41 PM.


MobileRead.com is a privately owned, operated and funded community.