Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-07-2010, 08:46 AM   #1
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
epub - force a 2nd pass to improve structure detection ?

I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please

or is my whole concept simply wrong/not workable ?

i decided to experiment with "the god delusion" which has been converted pdf to epub via a program called pdftoepub
( from web site pdftoepub.com )

now calibre will happliy convert that epub into mobi but when asked to converted it into rtf it fails part way through & the error details say something about mismatched brackets.

so is there such a thing as a non standard epub and is that pdftoepub program guilty of producing such a thing ?

Last edited by cybmole; 10-07-2010 at 09:09 AM.
cybmole is offline   Reply With Quote
Old 10-07-2010, 09:09 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cybmole View Post
I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please

or is my whole concept simply wrong/not workable ?
You mention bad text flow and a desire for improved structure detection. It sounds to me like you just want to edit the EPUB to fix some earlier conversion problems? I'd open it up in Sigil or explode it with Tweak Epub and change what you want. (If you're really looking for structure detection, you should be able to use the XPath capabilities built into Calibre.)
Starson17 is offline   Reply With Quote
Old 10-07-2010, 09:17 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by Starson17 View Post
You mention bad text flow and a desire for improved structure detection. It sounds to me like you just want to edit the EPUB to fix some earlier conversion problems?
yep - really I am explorign calibrea capabilites - the book is readable as is but Im curiosu as to whether further automated imporvements are possible.

more puszzling errors though - this time to took the .mobi version as my source ( as made by calibre from tehepub) and converted that to rtf - no problwm.

but the when I tell calibre to start with the rtf that it hs just made and convert back in to modi it throws up a bunch of errors & quits ???

why does it have trouble using it;s own output as a source file ?
details:
InputFormatPlugin: RTF Input running
on C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtf
Converting RTF to XML...
Preprocessing to convert unicode characters
Failed to preprocess RTF to convert unicode sequences, ignoring...
Traceback (most recent call last):
File "site-packages\calibre\ebooks\rtf\input.py", line 173, in preprocess
File "site-packages\calibre\ebooks\rtf\preprocess.py", line 124, in __init__
File "site-packages\calibre\ebooks\rtf\preprocess.py", line 198, in processUnicode
Exception: Error: incorect utf replacement.

C:\Users\dad\Documents\Calibre Library\Richard Dawkins\The God Delusion (470)\The God Delusion - Richard Dawkins.rtfPython function terminated unexpectedly
Invalid RTF: document does not have matching brackets.
(Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 107, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 832, in run
File "site-packages\calibre\customize\conversion.py", line 211, in __call__
File "site-packages\calibre\ebooks\rtf\input.py", line 194, in convert
File "site-packages\calibre\ebooks\rtf\input.py", line 89, in generate_xml
File "site-packages\calibre\ebooks\rtf2xml\ParseRtf.py", line 238, in parse_rtf
calibre.ebooks.rtf2xml.ParseRtf.InvalidRtfExceptio n: Invalid RTF: document does not have matching brackets.
cybmole is offline   Reply With Quote
Old 10-07-2010, 09:22 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cybmole View Post
why does it have trouble using it;s own output as a source file ?
It looks like a bug, but I couldn't say without analyzing the file. If you want to pursue it further, report it - http://bugs.calibre-ebook.com
Starson17 is offline   Reply With Quote
Old 10-07-2010, 09:23 AM   #5
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by Starson17 View Post
Quote:
Originally Posted by cybmole View Post
I have a book , now in epub which probably came originally from pdf as the text flow is poor in some places.

I'd like to force calibre to try & improve matters, But I read somewhere that structure detection is not applied when the source is epub.

so do I need do 2 more passes i.e. go epub to ? and then ? back to epub and what format should the intermediate ? be please
You mention bad text flow and a desire for improved structure detection.
I think he is wondering how to take advantage of the Preprocess input file to possibly improve structure detection. This preprocess does a great job of fixing paragraphs and text flow, but it isn't available for use with ePub as an input source.

I suppose converting to rtf then back to epub with the Preprocess input file to possibly improve structure detection checked might help reconstruct the paragraphs and improve text flow. The only way to know for sure is for him to try it and see what happens.

Update: I'm just too slow in the morning.
DoctorOhh is offline   Reply With Quote
Old 10-07-2010, 09:59 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by dwanthny View Post
I think he is wondering how to take advantage of the Preprocess input file to possibly improve structure detection. This preprocess does a great job of fixing paragraphs and text flow, but it isn't available for use with ePub as an input source.
Preprocess input file to possibly improve structure detection is sort of a magic button, without a lot of explanation/documentation of what it does. Still, in my limited testing, I've seen it add <h2> tags around various types of chapter separators, particularly in .txt format input. Given its name "possibly improve structure detection" I've never used it for basic problems with paragraphs or text flow, except near structure breaks of various types.
Starson17 is offline   Reply With Quote
Old 10-07-2010, 12:12 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Preprocess won't work for epub, but if you rename the epub from .epub to .zip and add the zip version back to the book record Calibre treats it identically to compressed html, which means preprocessing will work. You shouldn't have to go from epub to rtf and back.

Aside from looking for common chapters headings preprocessing does try to remove hard line breaks that are in the document. The default settings will only fix hard line breaks if the entire doc consists of hard line breaks. That's partially because of the line-unwrap factor - with only some broken lines the average/median line length is much larger than the actual break point where hard line breaks exist. If you have doc which has only some hard line breaks you need to set the unwrap factor much lower, possibly down to 0.2 or less.

All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files, but if you're trying to convert something that went through some weird conversions it may not match the doc format.

Last edited by ldolse; 10-07-2010 at 12:14 PM.
ldolse is offline   Reply With Quote
Old 10-07-2010, 07:02 PM   #8
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by ldolse View Post
All that is dependent on the actual book formatting though, preprocessing covers the most typical cases in Lit/html files
I used it on a LIT file that didn't show any obvious transition between paragraphs that would allow me to reconstruct the paragraphs by hand, my usual method. I was pleasantly surprised at how well the paragraphs were put back together and how well it used the unwrapping factor to mark the end of the paragraphs.
DoctorOhh is offline   Reply With Quote
Old 10-08-2010, 12:12 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc.

I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it.
ldolse is offline   Reply With Quote
Old 10-08-2010, 12:43 AM   #10
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by ldolse View Post
Glad to hear it worked for you. That kind of lit file can never be converted perfectly, but punctuation can give good clues for most cases. Still doing more tweaking on the function, so there should be improvements going forward, better unwrapping, more formats covered, etc.
Usually I would open the resultant epub in Sigil and find a tag of some sort between paragraphs. It would take a few find and replace actions to put the paragraphs back together. A couple of minutes at most. The referred to lit file though converted without a html tag hinting at the change of paragraphs.

I may only need preprocessing once in a blue moon but but when I do I'm glad its built in now.

Quote:
Originally Posted by ldolse View Post
I've been debating whether to make it a little less black box and give the user some control options, but that would probably require a whole separate preprocessing panel in addition to structure detection, so I'm not sure if the extra complexity would be worth it.
Leaving it as is, is probably best. Supporting a multiple option page will most likely drive you nuts.
DoctorOhh is offline   Reply With Quote
Old 10-08-2010, 01:00 AM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by dwanthny View Post
Usually I would open the resultant epub in Sigil and find a tag of some sort between paragraphs. It would take a few find and replace actions to put the paragraphs back together. A couple of minutes at most. The referred to lit file though converted without a html tag hinting at the change of paragraphs.
Some lit files are more or less text files wrapped in html with <pre> tags - sounds like this may be one of those. These come out exaclty as you describe with Calibre's default lit conversion pipeline. Preprocessing looks for those as a special case and runs them through the text input process before applying normal preprocessing. Probably still some more tweaking I could do there...

Quote:
Originally Posted by dwanthny View Post
Leaving it as is, is probably best. Supporting a multiple option page will most likely drive you nuts.
You make a very good point...
ldolse is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Trouble w structure detection jeff47 Calibre 1 10-13-2010 12:51 AM
Other Non-Fiction Abelson, H; Sussman G: Structure and Interpretation of Computer Programs (2nd Ed), v1 nrapallo ePub Books 0 10-07-2010 01:44 PM
Structure Detection Ceased To Exist? radiofred Calibre 3 10-01-2010 12:33 AM
Structure detection v5.5 and v6.2 AlexBell Calibre 2 07-29-2009 10:11 PM


All times are GMT -4. The time now is 04:36 AM.


MobileRead.com is a privately owned, operated and funded community.