View Single Post
Old 01-27-2011, 11:03 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Tudor Hulubei View Post
1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue.
Conversion already does this automatically for pdf and when line un-wrapping is enabled. If your book was already converted from another source and has those errors then enable the 'remove hyphens' option under heuristics. If you have a file where it's not working it's probably a bug - open up a bug report with an example book at bugs.calibre-ebook.com.


Quote:
Originally Posted by Tudor Hulubei View Post
2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number.
That type of formatting is also used for titles, chapter headers and lots of other types of content - it can't reliably used for header/footer detection. PDF header/footer removal will work based off of page position when the new pdf engine is ready.


Quote:
Originally Posted by Tudor Hulubei View Post
3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them.
Have you actually looked at the Heuristics section of Calibre's conversion options - by your post title I thought you had, but perhaps not? There's also an option for this there already - it's called 'unwrap lines'.

Quote:
Originally Posted by Tudor Hulubei View Post
4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.
Spell check would be extremely difficult to do without a WYSIWYG, which Calibre is not. This is a much better feature request for Sigil.

Last edited by ldolse; 01-27-2011 at 11:05 PM.
ldolse is offline   Reply With Quote