epub - force a 2nd pass to improve structure detection ?

cybmole · 10-07-2010, 08:46 AM

I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please

or is my whole concept simply wrong/not workable ?

i decided to experiment with "the god delusion" which has been converted pdf to epub via a program called pdftoepub
( from web site pdftoepub.com )

now calibre will happliy convert that epub into mobi but when asked to converted it into rtf it fails part way through & the error details say something about mismatched brackets.

so is there such a thing as a non standard epub and is that pdftoepub program guilty of producing such a thing ?

Starson17 · 10-07-2010, 09:09 AM

Quote:

Originally Posted by cybmole

I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please

or is my whole concept simply wrong/not workable ?

You mention bad text flow and a desire for improved structure detection. It sounds to me like you just want to edit the EPUB to fix some earlier conversion problems? I'd open it up in Sigil or explode it with Tweak Epub and change what you want. (If you're really looking for structure detection, you should be able to use the XPath capabilities built into Calibre.)

cybmole · 10-07-2010, 09:17 AM

Quote:

Originally Posted by Starson17

You mention bad text flow and a desire for improved structure detection. It sounds to me like you just want to edit the EPUB to fix some earlier conversion problems?

yep - really I am explorign calibrea capabilites - the book is readable as is but Im curiosu as to whether further automated imporvements are possible.

more puszzling errors though - this time to took the .mobi version as my source ( as made by calibre from tehepub) and converted that to rtf - no problwm.

but the when I tell calibre to start with the rtf that it hs just made and convert back in to modi it throws up a bunch of errors & quits ???

why does it have trouble using it;s own output as a source file ?
details:
InputFormatPlugin: RTF Input running
on C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtf
Converting RTF to XML...
Preprocessing to convert unicode characters
Failed to preprocess RTF to convert unicode sequences, ignoring...
Traceback (most recent call last):
File "site-packages\calibre\ebooks\rtf\input.py", line 173, in preprocess
File "site-packages\calibre\ebooks\rtf\preprocess.py", line 124, in __init__
File "site-packages\calibre\ebooks\rtf\preprocess.py", line 198, in processUnicode
Exception: Error: incorect utf replacement.

C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtfPython function terminated unexpectedly
Invalid RTF: document does not have matching brackets.
(Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 107, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 832, in run
File "site-packages\calibre\customize\conversion.py", line 211, in __call__
File "site-packages\calibre\ebooks\rtf\input.py", line 194, in convert
File "site-packages\calibre\ebooks\rtf\input.py", line 89, in generate_xml
File "site-packages\calibre\ebooks\rtf2xml\ParseRtf.py", line 238, in parse_rtf
calibre.ebooks.rtf2xml.ParseRtf.InvalidRtfExceptio n: Invalid RTF: document does not have matching brackets.

Starson17 · 10-07-2010, 09:22 AM

Quote:

Originally Posted by cybmole

why does it have trouble using it;s own output as a source file ?

It looks like a bug, but I couldn't say without analyzing the file. If you want to pursue it further, report it - http://bugs.calibre-ebook.com

DoctorOhh · 10-07-2010, 09:23 AM

Quote:

Originally Posted by Starson17

Quote:

Originally Posted by cybmole

I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please

You mention bad text flow and a desire for improved structure detection.

I think he is wondering how to take advantage of the Preprocess input file to possibly improve structure detection. This preprocess does a great job of fixing paragraphs and text flow, but it isn't available for use with ePub as an input source.

I suppose converting to rtf then back to epub with the Preprocess input file to possibly improve structure detection checked might help reconstruct the paragraphs and improve text flow. The only way to know for sure is for him to try it and see what happens.

Update: I'm just too slow in the morning.

Starson17 · 10-07-2010, 09:59 AM

Quote:

Originally Posted by dwanthny

I think he is wondering how to take advantage of the Preprocess input file to possibly improve structure detection. This preprocess does a great job of fixing paragraphs and text flow, but it isn't available for use with ePub as an input source.

Preprocess input file to possibly improve structure detection is sort of a magic button, without a lot of explanation/documentation of what it does. Still, in my limited testing, I've seen it add <h2> tags around various types of chapter separators, particularly in .txt format input. Given its name "possibly improve structure detection" I've never used it for basic problems with paragraphs or text flow, except near structure breaks of various types.

ldolse · 10-07-2010, 12:12 PM

Preprocess won't work for epub, but if you rename the epub from .epub to .zip and add the zip version back to the book record Calibre treats it identically to compressed html, which means preprocessing will work. You shouldn't have to go from epub to rtf and back.

Aside from looking for common chapters headings preprocessing does try to remove hard line breaks that are in the document. The default settings will only fix hard line breaks if the entire doc consists of hard line breaks. That's partially because of the line-unwrap factor - with only some broken lines the average/median line length is much larger than the actual break point where hard line breaks exist. If you have doc which has only some hard line breaks you need to set the unwrap factor much lower, possibly down to 0.2 or less.

All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files, but if you're trying to convert something that went through some weird conversions it may not match the doc format.

DoctorOhh · 10-07-2010, 07:02 PM

Quote:

Originally Posted by ldolse

All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files

I used it on a LIT file that didn't show any obvious transition between paragraphs that would allow me to reconstruct the paragraphs by hand, my usual method. I was pleasantly surprised at how well the paragraphs were put back together and how well it used the unwrapping factor to mark the end of the paragraphs.

ldolse · 10-08-2010, 12:12 AM

Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc.

I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it.

DoctorOhh · 10-08-2010, 12:43 AM

Quote:

Originally Posted by ldolse

Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc.

Usually I would open the resultant epub in Sigil and find a tag of some sort between paragraphs. It would take a few find and replace actions to put the paragraphs back together. A couple of minutes at most. The referred to lit file though converted without a html tag hinting at the change of paragraphs.

I may only need preprocessing once in a blue moon but but when I do I'm glad its built in now.

Quote:

Originally Posted by ldolse

I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it.

Leaving it as is, is probably best. Supporting a multiple option page will most likely drive you nuts.

ldolse · 10-08-2010, 01:00 AM

Quote:

Originally Posted by dwanthny

Usually I would open the resultant epub in Sigil and find a tag of some sort between paragraphs. It would take a few find and replace actions to put the paragraphs back together. A couple of minutes at most. The referred to lit file though converted without a html tag hinting at the change of paragraphs.

Some lit files are more or less text files wrapped in html with <pre> tags - sounds like this may be one of those. These come out exaclty as you describe with Calibre's default lit conversion pipeline. Preprocessing looks for those as a special case and runs them through the text input process before applying normal preprocessing. Probably still some more tweaking I could do there...

Quote:

Originally Posted by dwanthny

Leaving it as is, is probably best. Supporting a multiple option page will most likely drive you nuts.

You make a very good point...

10-07-2010, 08:46 AM	#1
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	epub - force a 2nd pass to improve structure detection ? I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places. I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub. so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please or is my whole concept simply wrong/not workable ? i decided to experiment with "the god delusion" which has been converted pdf to epub via a program called pdftoepub ( from web site pdftoepub.com ) now calibre will happliy convert that epub into mobi but when asked to converted it into rtf it fails part way through & the error details say something about mismatched brackets. so is there such a thing as a non standard epub and is that pdftoepub program guilty of producing such a thing ? Last edited by cybmole; 10-07-2010 at 09:09 AM.

10-07-2010, 12:12 PM	#7
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Preprocess won't work for epub, but if you rename the epub from .epub to .zip and add the zip version back to the book record Calibre treats it identically to compressed html, which means preprocessing will work. You shouldn't have to go from epub to rtf and back. Aside from looking for common chapters headings preprocessing does try to remove hard line breaks that are in the document. The default settings will only fix hard line breaks if the entire doc consists of hard line breaks. That's partially because of the line-unwrap factor - with only some broken lines the average/median line length is much larger than the actual break point where hard line breaks exist. If you have doc which has only some hard line breaks you need to set the unwrap factor much lower, possibly down to 0.2 or less. All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files, but if you're trying to convert something that went through some weird conversions it may not match the doc format. Last edited by ldolse; 10-07-2010 at 12:14 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Trouble w structure detection	jeff47	Calibre	1	10-13-2010 12:51 AM
Other Non-Fiction Abelson, H; Sussman G: Structure and Interpretation of Computer Programs (2nd Ed), v1	nrapallo	ePub Books	0	10-07-2010 01:44 PM
Structure Detection Ceased To Exist?	radiofred	Calibre	3	10-01-2010 12:33 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

10-08-2010, 12:12 AM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc. I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it.

Advert

Advert