Heuristics for cleaning up text
Hi there,
I have a few suggestions for some heuristics that could be applied to the text when converting from PDF to free-flowing text formats, such as epub. I've noticed the same problems again and again in many of the PDF->epub conversions I performed, so I figured it's worth suggesting them here.
1. There are a lot of instances where words that were split between at the end of the line are inserted with the dash. For instance, if the word "mother" is split at the end of a line and appears as "... mo-" at the end of one line and "ther" on the next line, it shows up as "mo-ther" in the epub file. Using a dictionary to check if the word without the dash is a valid word could eliminate this issue.
2. The heading/footer of PDF pages appears in the middle of paragraphs. This could be eliminated by noticing various characteristics. Sometimes the text is all caps "JUNGLE BOOK", sometimes the text has extra spaces in between letters, i.e. "J U N G L E B O O K". A statistical analysis of the text could reveal that this is a string that occurs very often, and/or occurs at equal distances in the text. Such occurrences are often preceded by the page number.
3. Paragraphs are often broken incorrectly, probably due to some idiosyncrasies of the OCR mechanism used. This seems like a very easy problem to solve - most paragraphs that start with a lowercase letter should probably be conflated with the paragraph before them.
4. The text resulted from the OCR phase should be spell-checked and the closest suggestion should be used to replace invalid words. That would eliminate many of the problems that I see now.
Hope this helps!
Regards,
Tudor
|