View Single Post
Old 05-06-2015, 04:23 PM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,079
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by msshain View Post
The heuristic processing worked great to unify paragraphs converting PDF to ePub. I am getting numbers at various intervals though. Please see example (24 & 25) below:

nonrealistic view suggested by quantum theory. 24 Einstein protested: “I cannot seriously believe in [the quantum theory] because it cannot be reconciled with the idea that physics should represent a reality in time and space, free from spooky actions at a distance.” 25 It was in a discussion of the EPR paper that Erwin Schrödinger first coined the term “entanglement.”

Any ideas how to omit these, thanks.
Your example is page numbers embedded within normal text (A very bad OCR).

This is a slightly tedious EDITOR job, not a conversion job.

REGEX in a conversion expects a FIXED pattern to the Page # appearance.
Long Winded 56
57 Short Story
Long Winded 103
104 Short Story

When it is (semi) random, you need to step through each find (there will be many patterns to find. you create a unique REGEX for each pattern you discover.

BTW This is probably a case to NOT have Heuristics clean up. The page pattern might have been easier to discover before the attempt to join lines. Every PDF is unique in the issues presented (see the sticky about PDF)
theducks is online now   Reply With Quote