Quote:
Originally Posted by msshain
The heuristic processing worked great to unify paragraphs converting PDF to ePub. I am getting numbers at various intervals though. Please see example (24 & 25) below:
nonrealistic view suggested by quantum theory. 24 Einstein protested: “I cannot seriously believe in [the quantum theory] because it cannot be reconciled with the idea that physics should represent a reality in time and space, free from spooky actions at a distance.” 25 It was in a discussion of the EPR paper that Erwin Schrödinger first coined the term “entanglement.”
Any ideas how to omit these, thanks.
|
Your example is page numbers embedded within normal text (A very bad OCR).
This is a slightly tedious EDITOR job,
not a conversion job.
REGEX in a conversion expects a FIXED pattern to the Page # appearance.
Long Winded 56
57 Short Story
Long Winded 103
104 Short Story
When it is (semi) random, you need to step through each find (there will be many patterns to find. you create a unique REGEX for each pattern you discover.
BTW This is probably
a case to NOT have Heuristics clean up. The page pattern might have been easier to discover before the attempt to join lines. Every PDF is unique in the issues presented (see the sticky about PDF)