Quote:
Originally Posted by cybmole
the rationale for scrapping such sources is that they often have other hard-to-fix problems: e.g paragraph breaks that occur mid sentence, missing or incomplete TOC, messed up punctuation, annoying OCR errors...
I treat the presence of hard coded "page numbers" as a warning sign:
"beware: crap conversion ahead" 
|
Hard page numbers ALSO come from PDF conversions (just one of many issues, most can be fixed with a bunch of REGEX, but NOT during conversion). OCR errors are always going to take TLC proofing to remove