Heuristics for cleaning up text - MobileRead Forums

		MobileRead Forums > E-Book Software > Calibre > Conversion
Heuristics for cleaning up text

Reply

Thread Tools

Search this Thread

01-27-2011, 10:36 PM	#1
Tudor Hulubei Junior Member Posts: 1 Karma: 10 Join Date: Jan 2011 Device: iPad	Heuristics for cleaning up text Hi there, I have a few suggestions for some heuristics that could be applied to the text when converting from PDF to free-flowing text formats, such as epub. I've noticed the same problems again and again in many of the PDF->epub conversions I performed, so I figured it's worth suggesting them here. 1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue. 2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number. 3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them. 4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now. Hope this helps! Regards, Tudor

Old

01-27-2011, 11:03 PM

#2

ldolse

Wizard

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

ldolse is an accomplished Snipe hunter.

Posts: 1,337

Karma: 123455

Join Date: Apr 2009

Location: Malaysia

Device: PRS-650, iPhone

Quote:

Originally Posted by Tudor Hulubei

View Post

1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue.

Conversion already does this automatically for pdf and when line un-wrapping is enabled. If your book was already converted from another source and has those errors then enable the 'remove hyphens' option under heuristics. If you have a file where it's not working it's probably a bug - open up a bug report with an example book at bugs.calibre-ebook.com.

Quote:

Originally Posted by Tudor Hulubei

View Post

2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number.

That type of formatting is also used for titles, chapter headers and lots of other types of content - it can't reliably used for header/footer detection. PDF header/footer removal will work based off of page position when the new pdf engine is ready.

Quote:

Originally Posted by Tudor Hulubei

View Post

3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them.

Have you actually looked at the Heuristics section of Calibre's conversion options - by your post title I thought you had, but perhaps not? There's also an option for this there already - it's called 'unwrap lines'.

Quote:

Originally Posted by Tudor Hulubei

View Post

4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.

Spell check would be extremely difficult to do without a WYSIWYG, which Calibre is not. This is a much better feature request for Sigil.

Last edited by ldolse; 01-27-2011 at 11:05 PM.

ldolse is offline

Reply With Quote

Advert

Old

01-28-2011, 08:21 AM

#3

cybmole

Wizard

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

cybmole ought to be getting tired of karma fortunes by now.

Posts: 3,720

Karma: 1759970

Join Date: Sep 2010

Device: none

Quote:

Originally Posted by Tudor Hulubei

View Post

Hi there,

4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.

Hope this helps!

Regards,
Tudor

that would ruin many novels - authors deliberately misspell / mis-hyphenate in many cases . e.g. Flowers for Algernon

+ there's the proper names issue - impossible to spell check character names..

get better sources + use Sigil + use Microspell

cybmole is offline

Reply With Quote

Old

01-28-2011, 08:27 AM

#4

DoctorOhh

US Navy, Retired

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

Posts: 9,864

Karma: 13806776

Join Date: Feb 2009

Location: North Carolina

Device: Icarus Illumina XL HD, Nexus 7

Quote:

Originally Posted by Tudor Hulubei

View Post

3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used.

4. The text resulted from the OCR phase

Could someone enlighten or correct me. I don't think there currently is an OCR mechanism during any calibre conversion.

If there is a OCR phase in calibre could you explain where or when this process occurs.

DoctorOhh is offline

Reply With Quote

01-28-2011, 09:42 AM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	There is no OCR phase in Calibre, but some of the source documents people use are rtf/txt/html files generated directly from OCR conversion software. Depending on the quality of the OCR software there can be a variety of issues. I've actually been scanning some favorite paperbooks that aren't available electronically lately, I think I'm going to add a special Heuristics function just for cleaning up ABBYY generated html - it's not fun going through it by hand, that's for sure.

Advert

Old

01-28-2011, 09:48 AM

#6

DoctorOhh

US Navy, Retired

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

DoctorOhh ought to be getting tired of karma fortunes by now.

Posts: 9,864

Karma: 13806776

Join Date: Feb 2009

Location: North Carolina

Device: Icarus Illumina XL HD, Nexus 7

Quote:

Originally Posted by ldolse

View Post

There is no OCR phase in Calibre, but some of the source documents people use are rtf/txt/html files generated directly from OCR conversion software. Depending on the quality of the OCR software there can be a variety of issues.

Thanks for the clarification.

I was confused because the OP sounded as if he thought the OCR part was built into calibre and nobody said different.

DoctorOhh is offline

Reply With Quote

Reply

« Previous Thread | Next Thread »

Forum Jump

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
new heuristics - how are conversion choices saved	cybmole	Conversion	1	01-22-2011 09:40 AM
new S&R - Heuristics - sigil compatible ?	cybmole	Conversion	1	01-22-2011 07:09 AM
Screen cleaning	melw	Bookeen	7	10-02-2008 11:52 AM
Help U. of Michigan Students with their study on "eReader Usability Heuristics!"	eReaderSurvey	Introduce Yourself	14	03-10-2008 10:27 AM

All times are GMT -4. The time now is 12:27 PM.