MobileRead Forums - View Single Post

Tex2002ans · 05-03-2016, 09:31 PM

Quote:

Originally Posted by notaguru

First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.

Indeed indeed... whoever did the initial job did a poor job.

A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point!

Quote:

Originally Posted by notaguru

Second, the page numbers appear with a space between the digits.

Hmmm, well without having exact code, all we can do is take a stab in the dark. For example, I came up with this one:

Search: \s[0-9]\s[0-9]*\s*[0-9]*\s*
Replace: *INSERT A SINGLE SPACE HERE*

If I broke the regex down into its own small pieces, each part does:

\s
- Look for a space
[0-9]
- Look for a single digit
\s
- Look for a space
[0-9]*
- Look for 0 or more digits
\s*
- Look for 0 or more spaces
[0-9]*
- Look for 0 or more digits
\s*
- Look for 0 or more spaces

This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?).

Before:

Code:

<p>test 1 2 3 test</p>
<p>test 1 2 test</p>
<p>test 1 test</p>
<p>test 12 test</p>
<p>test 1989 test</p>

After:

Code:

<p>test test</p>
<p>test test</p>
<p>test test</p>
<p>test 12 test</p>
<p>test 1989 test</p>

Note: Never use "Replace All" in these situations. Each Search/Replace should be checked individually as you replace them. For example, that regex above would delete "the 5 year old child" -> "the year old child".

Quote:

Originally Posted by notaguru

They're sometimes lower case, sometimes bold.

A number can be "lower case"?

But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:

bold page numbers might be wrapped in <b>s
maybe might be wrapped in <span>s
maybe some don't have spaces between each individual number
some might be in superscripts
maybe some might be wrapped in different "calibre##" classes
[...]