10-07-2010, 08:46 AM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
epub - force a 2nd pass to improve structure detection ?
I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.
I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub. so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please or is my whole concept simply wrong/not workable ? i decided to experiment with "the god delusion" which has been converted pdf to epub via a program called pdftoepub ( from web site pdftoepub.com ) now calibre will happliy convert that epub into mobi but when asked to converted it into rtf it fails part way through & the error details say something about mismatched brackets. so is there such a thing as a non standard epub and is that pdftoepub program guilty of producing such a thing ? Last edited by cybmole; 10-07-2010 at 09:09 AM. |
10-07-2010, 09:09 AM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
Advert | |
|
10-07-2010, 09:17 AM | #3 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
more puszzling errors though - this time to took the .mobi version as my source ( as made by calibre from tehepub) and converted that to rtf - no problwm. but the when I tell calibre to start with the rtf that it hs just made and convert back in to modi it throws up a bunch of errors & quits ??? why does it have trouble using it;s own output as a source file ? details: InputFormatPlugin: RTF Input running on C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtf Converting RTF to XML... Preprocessing to convert unicode characters Failed to preprocess RTF to convert unicode sequences, ignoring... Traceback (most recent call last): File "site-packages\calibre\ebooks\rtf\input.py", line 173, in preprocess File "site-packages\calibre\ebooks\rtf\preprocess.py", line 124, in __init__ File "site-packages\calibre\ebooks\rtf\preprocess.py", line 198, in processUnicode Exception: Error: incorect utf replacement. C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtfPython function terminated unexpectedly Invalid RTF: document does not have matching brackets. (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 107, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 832, in run File "site-packages\calibre\customize\conversion.py", line 211, in __call__ File "site-packages\calibre\ebooks\rtf\input.py", line 194, in convert File "site-packages\calibre\ebooks\rtf\input.py", line 89, in generate_xml File "site-packages\calibre\ebooks\rtf2xml\ParseRtf.py", line 238, in parse_rtf calibre.ebooks.rtf2xml.ParseRtf.InvalidRtfExceptio n: Invalid RTF: document does not have matching brackets. |
|
10-07-2010, 09:22 AM | #4 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
10-07-2010, 09:23 AM | #5 | ||
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
I suppose converting to rtf then back to epub with the Preprocess input file to possibly improve structure detection checked might help reconstruct the paragraphs and improve text flow. The only way to know for sure is for him to try it and see what happens. Update: I'm just too slow in the morning. |
||
Advert | |
|
10-07-2010, 09:59 AM | #6 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Preprocess input file to possibly improve structure detection is sort of a magic button, without a lot of explanation/documentation of what it does. Still, in my limited testing, I've seen it add <h2> tags around various types of chapter separators, particularly in .txt format input. Given its name "possibly improve structure detection" I've never used it for basic problems with paragraphs or text flow, except near structure breaks of various types.
|
10-07-2010, 12:12 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Preprocess won't work for epub, but if you rename the epub from .epub to .zip and add the zip version back to the book record Calibre treats it identically to compressed html, which means preprocessing will work. You shouldn't have to go from epub to rtf and back.
Aside from looking for common chapters headings preprocessing does try to remove hard line breaks that are in the document. The default settings will only fix hard line breaks if the entire doc consists of hard line breaks. That's partially because of the line-unwrap factor - with only some broken lines the average/median line length is much larger than the actual break point where hard line breaks exist. If you have doc which has only some hard line breaks you need to set the unwrap factor much lower, possibly down to 0.2 or less. All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files, but if you're trying to convert something that went through some weird conversions it may not match the doc format. Last edited by ldolse; 10-07-2010 at 12:14 PM. |
10-07-2010, 07:02 PM | #8 |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
I used it on a LIT file that didn't show any obvious transition between paragraphs that would allow me to reconstruct the paragraphs by hand, my usual method. I was pleasantly surprised at how well the paragraphs were put back together and how well it used the unwrapping factor to mark the end of the paragraphs.
|
10-08-2010, 12:12 AM | #9 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc.
I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it. |
10-08-2010, 12:43 AM | #10 | ||
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
I may only need preprocessing once in a blue moon but but when I do I'm glad its built in now. Quote:
|
||
10-08-2010, 01:00 AM | #11 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
You make a very good point... |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
Trouble w structure detection | jeff47 | Calibre | 1 | 10-13-2010 12:51 AM |
Other Non-Fiction Abelson, H; Sussman G: Structure and Interpretation of Computer Programs (2nd Ed), v1 | nrapallo | ePub Books | 0 | 10-07-2010 01:44 PM |
Structure Detection Ceased To Exist? | radiofred | Calibre | 3 | 10-01-2010 12:33 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |