Removing page numbers from OCR result

notaguru · 05-03-2016, 11:17 AM

At 78, I'm struggling with a painful transition from paper books to digital files, enabling me to manipulate the size of the typeface and also to carry a library with me. Old dog, new tricks!

Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits. After a frustrating hour or two with regex, I decided to ask for definitive help. But consider my age: please give me the fish, not the pole.

My goal is to use Calibre's editor to change something like this:

...the cathode must be biased 1 5 at about 1.5V to shift the output into a linear portion of the transconductance curve...

Into this:

...the cathode must be biased at about 1.5V to shift the output into a linear portion of the transconductance curve...

So, is there a set of regex lines that will do this?

This is my family account here: my son will help, but regex is alien to him as well.

Thanks!

BetterRed · 05-03-2016, 08:09 PM

@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.

And if you have existing skills with a word processor such as Word or Writer, then perhaps you should consider 'fixing' the bulk of the errors that result from scanning in the word processor - there are a number of alternatives for converting the word processing format to EPUB.

BR - another septuagenarian amateur

Tex2002ans · 05-03-2016, 08:43 PM

Quote:

Originally Posted by notaguru

Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits.

There was just a thread posted about a week ago called "Delete paragraphs in scanned books (S & R with regexes)" that covered a few examples of page number removal using regex:

https://www.mobileread.com/forums/sho...d.php?t=273495

Having page numbers mixed in the text is pretty impossible to distinguish from normal numbers... unless you can find some sort of pattern.

You mentioned a space between numbers... does a long page number look the same?
- For example, page 201 -> "example 2 0 1 text"
Is there always spacing between the individual numbers?
If you look at the code itself using Calibre's Editor, is there any sort of pattern you can see?
- For example, is it small and bold while the surrounding text is normal size?
- Is the page number on its own line?
- Is the page number in its own span?
[...]

Can't really help unless more examples given as well.

If the numbers all appear in the middle of the text, this might be a very hard problem... because it would be impossible to distinguish between normal numbers + page numbers.

Quote:

Originally Posted by BetterRed

@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.

Indeed. Did you scan/OCR this yourself? Maybe you could take care of the page number problem before/during the OCR step instead!

notaguru · 05-03-2016, 09:03 PM

Thanks!

First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.

Second, the page numbers appear with a space between the digits. They're sometimes lower case, sometimes bold.

??

Tex2002ans · 05-03-2016, 09:31 PM

Quote:

Originally Posted by notaguru

First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.

Indeed indeed... whoever did the initial job did a poor job.

A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point!

Quote:

Originally Posted by notaguru

Second, the page numbers appear with a space between the digits.

Hmmm, well without having exact code, all we can do is take a stab in the dark. For example, I came up with this one:

Search: \s[0-9]\s[0-9]*\s*[0-9]*\s*
Replace: *INSERT A SINGLE SPACE HERE*

If I broke the regex down into its own small pieces, each part does:

\s
- Look for a space
[0-9]
- Look for a single digit
\s
- Look for a space
[0-9]*
- Look for 0 or more digits
\s*
- Look for 0 or more spaces
[0-9]*
- Look for 0 or more digits
\s*
- Look for 0 or more spaces

This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?).

Before:

Code:

<p>test 1 2 3 test</p>
<p>test 1 2 test</p>
<p>test 1 test</p>
<p>test 12 test</p>
<p>test 1989 test</p>

After:

Code:

<p>test test</p>
<p>test test</p>
<p>test test</p>
<p>test 12 test</p>
<p>test 1989 test</p>

Note: Never use "Replace All" in these situations. Each Search/Replace should be checked individually as you replace them. For example, that regex above would delete "the 5 year old child" -> "the year old child".

Quote:

Originally Posted by notaguru

They're sometimes lower case, sometimes bold.

A number can be "lower case"?

But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:

bold page numbers might be wrapped in <b>s
maybe might be wrapped in <span>s
maybe some don't have spaces between each individual number
some might be in superscripts
maybe some might be wrapped in different "calibre##" classes
[...]

notaguru · 05-03-2016, 10:09 PM

The space between digits can be upper or lower case. All characters and spaces can be bold, italic, etc.

The proposed string - \s[0-9]\s[0-9]*\s*[0-9]*\s* - worked properly.

My problem is.... SOLVED!!

Thanks very much.

notaguru · 05-04-2016, 10:52 AM

...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy.

B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD.

\s[0-9]\s

But there may be an even neater solution. The HTML shows

<h3 class="calibre7">9 </h3><p class="calibre2">

How can I delete those lines, replacing the digit "9" with any single digit?

theducks · 05-04-2016, 11:10 AM

Quote:

Originally Posted by notaguru

...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy.

B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD.

\s[0-9]\s

But there may be an even neater solution. The HTML shows

<h3 class="calibre7">9 </h3><p class="calibre2">

How can I delete those lines, replacing the digit "9" with any single digit?

Code:

<h3 class="calibre7">\d\s</h3>\s*<p class="calibre2">

\d is any digit
\s is a space (I prefer to see spaces in my pattern There was a space in your example

notaguru · 05-04-2016, 11:13 AM

My error...

I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well.

theducks · 05-04-2016, 11:51 AM

Quote:

Originally Posted by notaguru

My error...

I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well.

ALL of What you see in search (there are codes the just look ahead/behind, but we are NOT going there

)

Will be replaced with with you see in Replace if replace is blank

Tex2002ans · 05-04-2016, 02:53 PM

Quote:

Originally Posted by notaguru

B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text.

Are you sure these numbers aren't subchapters?

notaguru · 05-04-2016, 02:56 PM

No, I'm a stranger in a strange land - not sure of anything except the value of the information in those screwed up documents.

I entered this <h3 class="calibre7">9 </h3><p class="calibre2"> and did repetitive search/replace while changing that single digit, replacing the whole string with a space. Looks pretty good, but I might have to do this many times so would welcome a more efficient solution.

05-03-2016, 11:17 AM	#1
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	Removing page numbers from OCR result At 78, I'm struggling with a painful transition from paper books to digital files, enabling me to manipulate the size of the typeface and also to carry a library with me. Old dog, new tricks! Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits. After a frustrating hour or two with regex, I decided to ask for definitive help. But consider my age: please give me the fish, not the pole. My goal is to use Calibre's editor to change something like this: ...the cathode must be biased 1 5 at about 1.5V to shift the output into a linear portion of the transconductance curve... Into this: ...the cathode must be biased at about 1.5V to shift the output into a linear portion of the transconductance curve... So, is there a set of regex lines that will do this? This is my family account here: my son will help, but regex is alien to him as well. Thanks! Last edited by notaguru; 05-03-2016 at 11:20 AM.

05-04-2016, 10:52 AM	#7
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy. B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD. \s[0-9]\s But there may be an even neater solution. The HTML shows <h3 class="calibre7">9 </h3><p class="calibre2"> How can I delete those lines, replacing the digit "9" with any single digit? Last edited by notaguru; 05-04-2016 at 10:59 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Removing page numbers within text?	Johann Cat	Conversion	6	01-09-2015 03:45 PM
[Old Thread] Removing page numbers.	ChaoZ	Calibre	8	10-20-2014 03:02 PM
Removing headers/page numbers	greycobalt	Calibre	3	10-10-2010 01:57 PM
Removing Page Numbers	ManosHandsOfFate	Calibre	6	09-28-2010 12:12 PM
Removing page numbers?	Cap.T	Calibre	1	02-21-2010 09:57 AM

05-03-2016, 08:09 PM	#2
BetterRed null operator (he/him) Posts: 20,570 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best. And if you have existing skills with a word processor such as Word or Writer, then perhaps you should consider 'fixing' the bulk of the errors that result from scanning in the word processor - there are a number of alternatives for converting the word processing format to EPUB. BR - another septuagenarian amateur

05-03-2016, 09:03 PM	#4
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	Thanks! First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem. Second, the page numbers appear with a space between the digits. They're sometimes lower case, sometimes bold. ??

05-03-2016, 10:09 PM	#6
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	The space between digits can be upper or lower case. All characters and spaces can be bold, italic, etc. The proposed string - \s[0-9]\s[0-9]\s[0-9]\s - worked properly. My problem is.... SOLVED!! Thanks very much.

05-04-2016, 11:13 AM	#9
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	My error... I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well.

05-04-2016, 02:56 PM	#12
notaguru Member Posts: 11 Karma: 10 Join Date: May 2010 Device: Android tablet	No, I'm a stranger in a strange land - not sure of anything except the value of the information in those screwed up documents. I entered this <h3 class="calibre7">9 </h3><p class="calibre2"> and did repetitive search/replace while changing that single digit, replacing the whole string with a space. Looks pretty good, but I might have to do this many times so would welcome a more efficient solution.