Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 01-27-2011, 10:36 PM   #1
Tudor Hulubei
Junior Member
Tudor Hulubei began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jan 2011
Device: iPad
Heuristics for cleaning up text

Hi there,

I have a few suggestions for some heuristics that could be applied to the text when converting from PDF to free-flowing text formats, such as epub. I've noticed the same problems again and again in many of the PDF->epub conversions I performed, so I figured it's worth suggesting them here.

1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue.

2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number.

3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them.

4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.

Hope this helps!

Regards,
Tudor
Tudor Hulubei is offline   Reply With Quote
Old 01-27-2011, 11:03 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Tudor Hulubei View Post
1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue.
Conversion already does this automatically for pdf and when line un-wrapping is enabled. If your book was already converted from another source and has those errors then enable the 'remove hyphens' option under heuristics. If you have a file where it's not working it's probably a bug - open up a bug report with an example book at bugs.calibre-ebook.com.


Quote:
Originally Posted by Tudor Hulubei View Post
2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number.
That type of formatting is also used for titles, chapter headers and lots of other types of content - it can't reliably used for header/footer detection. PDF header/footer removal will work based off of page position when the new pdf engine is ready.


Quote:
Originally Posted by Tudor Hulubei View Post
3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them.
Have you actually looked at the Heuristics section of Calibre's conversion options - by your post title I thought you had, but perhaps not? There's also an option for this there already - it's called 'unwrap lines'.

Quote:
Originally Posted by Tudor Hulubei View Post
4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.
Spell check would be extremely difficult to do without a WYSIWYG, which Calibre is not. This is a much better feature request for Sigil.

Last edited by ldolse; 01-27-2011 at 11:05 PM.
ldolse is offline   Reply With Quote
Advert
Old 01-28-2011, 08:21 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by Tudor Hulubei View Post
Hi there,


4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.

Hope this helps!

Regards,
Tudor
that would ruin many novels - authors deliberately misspell / mis-hyphenate in many cases . e.g. Flowers for Algernon

+ there's the proper names issue - impossible to spell check character names..

get better sources + use Sigil + use Microspell
cybmole is offline   Reply With Quote
Old 01-28-2011, 08:27 AM   #4
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by Tudor Hulubei View Post
3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used.

4. The text resulted from the OCR phase
Could someone enlighten or correct me. I don't think there currently is an OCR mechanism during any calibre conversion.

If there is a OCR phase in calibre could you explain where or when this process occurs.
DoctorOhh is offline   Reply With Quote
Old 01-28-2011, 09:42 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
There is no OCR phase in Calibre, but some of the source documents people use are rtf/txt/html files generated directly from OCR conversion software. Depending on the quality of the OCR software there can be a variety of issues.

I've actually been scanning some favorite paperbooks that aren't available electronically lately, I think I'm going to add a special Heuristics function just for cleaning up ABBYY generated html - it's not fun going through it by hand, that's for sure.
ldolse is offline   Reply With Quote
Advert
Old 01-28-2011, 09:48 AM   #6
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by ldolse View Post
There is no OCR phase in Calibre, but some of the source documents people use are rtf/txt/html files generated directly from OCR conversion software. Depending on the quality of the OCR software there can be a variety of issues.
Thanks for the clarification.

I was confused because the OP sounded as if he thought the OCR part was built into calibre and nobody said different.
DoctorOhh is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
new heuristics - how are conversion choices saved cybmole Conversion 1 01-22-2011 09:40 AM
new S&R - Heuristics - sigil compatible ? cybmole Conversion 1 01-22-2011 07:09 AM
Screen cleaning melw Bookeen 7 10-02-2008 11:52 AM
Help U. of Michigan Students with their study on "eReader Usability Heuristics!" eReaderSurvey Introduce Yourself 14 03-10-2008 10:27 AM


All times are GMT -4. The time now is 12:27 PM.


MobileRead.com is a privately owned, operated and funded community.