![]() |
#1 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jan 2011
Device: iPad
|
Heuristics for cleaning up text
Hi there,
I have a few suggestions for some heuristics that could be applied to the text when converting from PDF to free-flowing text formats, such as epub. I've noticed the same problems again and again in many of the PDF->epub conversions I performed, so I figured it's worth suggesting them here. 1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue. 2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number. 3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them. 4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now. Hope this helps! Regards, Tudor |
![]() |
![]() |
![]() |
#2 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Quote:
Quote:
Spell check would be extremely difficult to do without a WYSIWYG, which Calibre is not. This is a much better feature request for Sigil. Last edited by ldolse; 01-27-2011 at 11:05 PM. |
|||
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
+ there's the proper names issue - impossible to spell check character names.. get better sources + use Sigil + use Microspell |
|
![]() |
![]() |
![]() |
#4 | |
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,889
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
Quote:
If there is a OCR phase in calibre could you explain where or when this process occurs. |
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
There is no OCR phase in Calibre, but some of the source documents people use are rtf/txt/html files generated directly from OCR conversion software. Depending on the quality of the OCR software there can be a variety of issues.
I've actually been scanning some favorite paperbooks that aren't available electronically lately, I think I'm going to add a special Heuristics function just for cleaning up ABBYY generated html - it's not fun going through it by hand, that's for sure. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,889
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
Quote:
![]() I was confused because the OP sounded as if he thought the OCR part was built into calibre and nobody said different. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
new heuristics - how are conversion choices saved | cybmole | Conversion | 1 | 01-22-2011 09:40 AM |
new S&R - Heuristics - sigil compatible ? | cybmole | Conversion | 1 | 01-22-2011 07:09 AM |
Screen cleaning | melw | Bookeen | 7 | 10-02-2008 11:52 AM |
Help U. of Michigan Students with their study on "eReader Usability Heuristics!" | eReaderSurvey | Introduce Yourself | 14 | 03-10-2008 10:27 AM |