Quote:
Originally Posted by notaguru
First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.
|
Indeed indeed... whoever did the initial job did a poor job.
A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point!
Quote:
Originally Posted by notaguru
Second, the page numbers appear with a space between the digits.
|
Hmmm, well without having exact code, all we can do is take a stab in the dark. For example, I came up with this one:
Search: \s[0-9]\s[0-9]*\s*[0-9]*\s*
Replace: *INSERT A SINGLE SPACE HERE*
If I broke the regex down into its own small pieces, each part does:
- \s
- [0-9]
- \s
- [0-9]*
- Look for 0 or more digits
- \s*
- Look for 0 or more spaces
- [0-9]*
- Look for 0 or more digits
- \s*
- Look for 0 or more spaces
This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?).
Before:
Code:
<p>test 1 2 3 test</p>
<p>test 1 2 test</p>
<p>test 1 test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
After:
Code:
<p>test test</p>
<p>test test</p>
<p>test test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
Note: Never use "Replace All" in these situations. Each Search/Replace should be checked individually as you replace them. For example, that regex above would delete "the 5 year old child" -> "the year old child".
Quote:
Originally Posted by notaguru
They're sometimes lower case, sometimes bold.
|
A number can be "lower case"?
But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:
- bold page numbers might be wrapped in <b>s
- maybe might be wrapped in <span>s
- maybe some don't have spaces between each individual number
- some might be in superscripts
- maybe some might be wrapped in different "calibre##" classes
- [...]