View Single Post
Old 05-03-2016, 09:31 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by notaguru View Post
First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.
Indeed indeed... whoever did the initial job did a poor job.

A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point!

Quote:
Originally Posted by notaguru View Post
Second, the page numbers appear with a space between the digits.
Hmmm, well without having exact code, all we can do is take a stab in the dark. For example, I came up with this one:

Search: \s[0-9]\s[0-9]*\s*[0-9]*\s*
Replace: *INSERT A SINGLE SPACE HERE*

If I broke the regex down into its own small pieces, each part does:
  • \s
    • Look for a space
  • [0-9]
    • Look for a single digit
  • \s
    • Look for a space
  • [0-9]*
    • Look for 0 or more digits
  • \s*
    • Look for 0 or more spaces
  • [0-9]*
    • Look for 0 or more digits
  • \s*
    • Look for 0 or more spaces

This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?).

Before:

Code:
<p>test 1 2 3 test</p>
<p>test 1 2 test</p>
<p>test 1 test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
After:

Code:
<p>test test</p>
<p>test test</p>
<p>test test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
Note: Never use "Replace All" in these situations. Each Search/Replace should be checked individually as you replace them. For example, that regex above would delete "the 5 year old child" -> "the year old child".

Quote:
Originally Posted by notaguru View Post
They're sometimes lower case, sometimes bold.
A number can be "lower case"?

But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:
  • bold page numbers might be wrapped in <b>s
  • maybe might be wrapped in <span>s
  • maybe some don't have spaces between each individual number
  • some might be in superscripts
  • maybe some might be wrapped in different "calibre##" classes
  • [...]

Last edited by Tex2002ans; 05-04-2016 at 01:28 AM.
Tex2002ans is offline   Reply With Quote