Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 05-03-2016, 11:17 AM   #1
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
Removing page numbers from OCR result

At 78, I'm struggling with a painful transition from paper books to digital files, enabling me to manipulate the size of the typeface and also to carry a library with me. Old dog, new tricks!

Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits. After a frustrating hour or two with regex, I decided to ask for definitive help. But consider my age: please give me the fish, not the pole.

My goal is to use Calibre's editor to change something like this:

...the cathode must be biased 1 5 at about 1.5V to shift the output into a linear portion of the transconductance curve...

Into this:

...the cathode must be biased at about 1.5V to shift the output into a linear portion of the transconductance curve...

So, is there a set of regex lines that will do this?

This is my family account here: my son will help, but regex is alien to him as well.

Thanks!

Last edited by notaguru; 05-03-2016 at 11:20 AM.
notaguru is offline   Reply With Quote
Old 05-03-2016, 08:09 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,570
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.

And if you have existing skills with a word processor such as Word or Writer, then perhaps you should consider 'fixing' the bulk of the errors that result from scanning in the word processor - there are a number of alternatives for converting the word processing format to EPUB.

BR - another septuagenarian amateur
BetterRed is offline   Reply With Quote
Old 05-03-2016, 08:43 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by notaguru View Post
Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits.
There was just a thread posted about a week ago called "Delete paragraphs in scanned books (S & R with regexes)" that covered a few examples of page number removal using regex:

https://www.mobileread.com/forums/sho...d.php?t=273495

Having page numbers mixed in the text is pretty impossible to distinguish from normal numbers... unless you can find some sort of pattern.
  • You mentioned a space between numbers... does a long page number look the same?
    • For example, page 201 -> "example 2 0 1 text"
  • Is there always spacing between the individual numbers?
  • If you look at the code itself using Calibre's Editor, is there any sort of pattern you can see?
    • For example, is it small and bold while the surrounding text is normal size?
    • Is the page number on its own line?
    • Is the page number in its own span?
  • [...]

Can't really help unless more examples given as well.

If the numbers all appear in the middle of the text, this might be a very hard problem... because it would be impossible to distinguish between normal numbers + page numbers.

Quote:
Originally Posted by BetterRed View Post
@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.
Indeed. Did you scan/OCR this yourself? Maybe you could take care of the page number problem before/during the OCR step instead!

Last edited by Tex2002ans; 05-03-2016 at 08:50 PM.
Tex2002ans is offline   Reply With Quote
Old 05-03-2016, 09:03 PM   #4
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
Thanks!

First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.

Second, the page numbers appear with a space between the digits. They're sometimes lower case, sometimes bold.

??
notaguru is offline   Reply With Quote
Old 05-03-2016, 09:31 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by notaguru View Post
First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem.
Indeed indeed... whoever did the initial job did a poor job.

A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point!

Quote:
Originally Posted by notaguru View Post
Second, the page numbers appear with a space between the digits.
Hmmm, well without having exact code, all we can do is take a stab in the dark. For example, I came up with this one:

Search: \s[0-9]\s[0-9]*\s*[0-9]*\s*
Replace: *INSERT A SINGLE SPACE HERE*

If I broke the regex down into its own small pieces, each part does:
  • \s
    • Look for a space
  • [0-9]
    • Look for a single digit
  • \s
    • Look for a space
  • [0-9]*
    • Look for 0 or more digits
  • \s*
    • Look for 0 or more spaces
  • [0-9]*
    • Look for 0 or more digits
  • \s*
    • Look for 0 or more spaces

This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?).

Before:

Code:
<p>test 1 2 3 test</p>
<p>test 1 2 test</p>
<p>test 1 test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
After:

Code:
<p>test test</p>
<p>test test</p>
<p>test test</p>
<p>test 12 test</p>
<p>test 1989 test</p>
Note: Never use "Replace All" in these situations. Each Search/Replace should be checked individually as you replace them. For example, that regex above would delete "the 5 year old child" -> "the year old child".

Quote:
Originally Posted by notaguru View Post
They're sometimes lower case, sometimes bold.
A number can be "lower case"?

But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:
  • bold page numbers might be wrapped in <b>s
  • maybe might be wrapped in <span>s
  • maybe some don't have spaces between each individual number
  • some might be in superscripts
  • maybe some might be wrapped in different "calibre##" classes
  • [...]

Last edited by Tex2002ans; 05-04-2016 at 01:28 AM.
Tex2002ans is offline   Reply With Quote
Old 05-03-2016, 10:09 PM   #6
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
The space between digits can be upper or lower case. All characters and spaces can be bold, italic, etc.

The proposed string - \s[0-9]\s[0-9]*\s*[0-9]*\s* - worked properly.

My problem is.... SOLVED!!

Thanks very much.
notaguru is offline   Reply With Quote
Old 05-04-2016, 10:52 AM   #7
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy.

B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD.

\s[0-9]\s

But there may be an even neater solution. The HTML shows

<h3 class="calibre7">9 </h3><p class="calibre2">

How can I delete those lines, replacing the digit "9" with any single digit?

Last edited by notaguru; 05-04-2016 at 10:59 AM.
notaguru is offline   Reply With Quote
Old 05-04-2016, 11:10 AM   #8
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,803
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by notaguru View Post
...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy.

B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD.

\s[0-9]\s

But there may be an even neater solution. The HTML shows

<h3 class="calibre7">9 </h3><p class="calibre2">

How can I delete those lines, replacing the digit "9" with any single digit?
Code:
<h3 class="calibre7">\d\s</h3>\s*<p class="calibre2">
\d is any digit
\s is a space (I prefer to see spaces in my pattern There was a space in your example

Last edited by theducks; 05-04-2016 at 11:12 AM. Reason: If prettifiedit needs line end help
theducks is offline   Reply With Quote
Old 05-04-2016, 11:13 AM   #9
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
My error...

I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well.
notaguru is offline   Reply With Quote
Old 05-04-2016, 11:51 AM   #10
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,803
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by notaguru View Post
My error...

I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well.
ALL of What you see in search (there are codes the just look ahead/behind, but we are NOT going there )


Will be replaced with with you see in Replace if replace is blank
theducks is offline   Reply With Quote
Old 05-04-2016, 02:53 PM   #11
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by notaguru View Post
B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text.
Are you sure these numbers aren't subchapters?
Tex2002ans is offline   Reply With Quote
Old 05-04-2016, 02:56 PM   #12
notaguru
Member
notaguru began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
No, I'm a stranger in a strange land - not sure of anything except the value of the information in those screwed up documents.

I entered this <h3 class="calibre7">9 </h3><p class="calibre2"> and did repetitive search/replace while changing that single digit, replacing the whole string with a space. Looks pretty good, but I might have to do this many times so would welcome a more efficient solution.
notaguru is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing page numbers within text? Johann Cat Conversion 6 01-09-2015 03:45 PM
[Old Thread] Removing page numbers. ChaoZ Calibre 8 10-20-2014 03:02 PM
Removing headers/page numbers greycobalt Calibre 3 10-10-2010 01:57 PM
Removing Page Numbers ManosHandsOfFate Calibre 6 09-28-2010 12:12 PM
Removing page numbers? Cap.T Calibre 1 02-21-2010 09:57 AM


All times are GMT -4. The time now is 10:24 PM.


MobileRead.com is a privately owned, operated and funded community.