05-03-2016, 11:17 AM | #1 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
Removing page numbers from OCR result
At 78, I'm struggling with a painful transition from paper books to digital files, enabling me to manipulate the size of the typeface and also to carry a library with me. Old dog, new tricks!
Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits. After a frustrating hour or two with regex, I decided to ask for definitive help. But consider my age: please give me the fish, not the pole. My goal is to use Calibre's editor to change something like this: ...the cathode must be biased 1 5 at about 1.5V to shift the output into a linear portion of the transconductance curve... Into this: ...the cathode must be biased at about 1.5V to shift the output into a linear portion of the transconductance curve... So, is there a set of regex lines that will do this? This is my family account here: my son will help, but regex is alien to him as well. Thanks! Last edited by notaguru; 05-03-2016 at 11:20 AM. |
05-03-2016, 08:09 PM | #2 |
null operator (he/him)
Posts: 20,570
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.
And if you have existing skills with a word processor such as Word or Writer, then perhaps you should consider 'fixing' the bulk of the errors that result from scanning in the word processor - there are a number of alternatives for converting the word processing format to EPUB. BR - another septuagenarian amateur |
05-03-2016, 08:43 PM | #3 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
https://www.mobileread.com/forums/sho...d.php?t=273495 Having page numbers mixed in the text is pretty impossible to distinguish from normal numbers... unless you can find some sort of pattern.
Can't really help unless more examples given as well. If the numbers all appear in the middle of the text, this might be a very hard problem... because it would be impossible to distinguish between normal numbers + page numbers. Indeed. Did you scan/OCR this yourself? Maybe you could take care of the page number problem before/during the OCR step instead! Last edited by Tex2002ans; 05-03-2016 at 08:50 PM. |
|
05-03-2016, 09:03 PM | #4 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
Thanks!
First, these are books and documents scanned by others pre-retirement, and I do not have the ability to revisit the original paper. I don't know what software was used with the OCR hardware, but am clearly stuck with the problem. Second, the page numbers appear with a space between the digits. They're sometimes lower case, sometimes bold. ?? |
05-03-2016, 09:31 PM | #5 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
A lot of the times really crappy conversions cause you even more headache/time than just scrapping and starting from a better starting point! Quote:
Search: \s[0-9]\s[0-9]*\s*[0-9]*\s* Replace: *INSERT A SINGLE SPACE HERE* If I broke the regex down into its own small pieces, each part does:
This should be able to catch 1, 2, or 3 single digits with spaces in between them (I assume this will not be run on books with more than 1000 pages?). Before: Code:
<p>test 1 2 3 test</p> <p>test 1 2 test</p> <p>test 1 test</p> <p>test 12 test</p> <p>test 1989 test</p> Code:
<p>test test</p> <p>test test</p> <p>test test</p> <p>test 12 test</p> <p>test 1989 test</p> A number can be "lower case"? But again... we can't really know what to do unless we can see the exact code from your books. Each book is like a unique fingerprint. There could be a lot of problems introduced that throw a wrench into the works:
Last edited by Tex2002ans; 05-04-2016 at 01:28 AM. |
||
05-03-2016, 10:09 PM | #6 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
The space between digits can be upper or lower case. All characters and spaces can be bold, italic, etc.
The proposed string - \s[0-9]\s[0-9]*\s*[0-9]*\s* - worked properly. My problem is.... SOLVED!! Thanks very much. |
05-04-2016, 10:52 AM | #7 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
...and the solution to Problem A illuminated Problem B - which is at a lower level of importance, but I'm greedy.
B includes spurious bold and large single-digit chapter headings. They're spurious because they don't really define chapters but produce unwanted breaks in the text. I don't know why it happened, but the solution might be as simple as that shown below, though I welcome a way to further focus upon the offending digit by identifying it as BOLD. \s[0-9]\s But there may be an even neater solution. The HTML shows <h3 class="calibre7">9 </h3><p class="calibre2"> How can I delete those lines, replacing the digit "9" with any single digit? Last edited by notaguru; 05-04-2016 at 10:59 AM. |
05-04-2016, 11:10 AM | #8 | |
Well trained by Cats
Posts: 29,803
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Code:
<h3 class="calibre7">\d\s</h3>\s*<p class="calibre2"> \s is a space (I prefer to see spaces in my pattern There was a space in your example Last edited by theducks; 05-04-2016 at 11:12 AM. Reason: If prettifiedit needs line end help |
|
05-04-2016, 11:13 AM | #9 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
My error...
I want to replace that single digit, which is "9" in the example, with nothingness. Also, it's possible that in HTML the H3 automatically introduces breaks, so I'd like to get rid of that as well. |
05-04-2016, 11:51 AM | #10 | |
Well trained by Cats
Posts: 29,803
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Will be replaced with with you see in Replace if replace is blank |
|
05-04-2016, 02:53 PM | #11 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
|
05-04-2016, 02:56 PM | #12 |
Member
Posts: 11
Karma: 10
Join Date: May 2010
Device: Android tablet
|
No, I'm a stranger in a strange land - not sure of anything except the value of the information in those screwed up documents.
I entered this <h3 class="calibre7">9 </h3><p class="calibre2"> and did repetitive search/replace while changing that single digit, replacing the whole string with a space. Looks pretty good, but I might have to do this many times so would welcome a more efficient solution. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing page numbers within text? | Johann Cat | Conversion | 6 | 01-09-2015 03:45 PM |
[Old Thread] Removing page numbers. | ChaoZ | Calibre | 8 | 10-20-2014 03:02 PM |
Removing headers/page numbers | greycobalt | Calibre | 3 | 10-10-2010 01:57 PM |
Removing Page Numbers | ManosHandsOfFate | Calibre | 6 | 09-28-2010 12:12 PM |
Removing page numbers? | Cap.T | Calibre | 1 | 02-21-2010 09:57 AM |