MobileRead Forums - View Single Post - Erase Page? Numbers after heuristic corrections

theducks · 05-06-2015, 05:23 PM

Quote:

Originally Posted by msshain

The heuristic processing worked great to unify paragraphs converting PDF to ePub. I am getting numbers at various intervals though. Please see example (24 & 25) below:

nonrealistic view suggested by quantum theory. 24 Einstein protested: “I cannot seriously believe in [the quantum theory] because it cannot be reconciled with the idea that physics should represent a reality in time and space, free from spooky actions at a distance.” 25 It was in a discussion of the EPR paper that Erwin Schrödinger first coined the term “entanglement.”

Any ideas how to omit these, thanks.

Your example is page numbers embedded within normal text (A very bad OCR).

This is a slightly tedious EDITOR job, not a conversion job.

REGEX in a conversion expects a FIXED pattern to the Page # appearance.
Long Winded 56
57 Short Story
Long Winded 103
104 Short Story

When it is (semi) random, you need to step through each find (there will be many patterns to find. you create a unique REGEX for each pattern you discover.

BTW This is probably a case to NOT have Heuristics clean up. The page pattern might have been easier to discover before the attempt to join lines. Every PDF is unique in the issues presented (see the sticky about PDF)