View Single Post
Old 05-03-2016, 08:43 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by notaguru View Post
Many of my books were digitized by scanning. The result often includes page numbers, which randomly appear within the text because of differences in page size. Too often, the original book designer put a space between the digits.
There was just a thread posted about a week ago called "Delete paragraphs in scanned books (S & R with regexes)" that covered a few examples of page number removal using regex:

https://www.mobileread.com/forums/sho...d.php?t=273495

Having page numbers mixed in the text is pretty impossible to distinguish from normal numbers... unless you can find some sort of pattern.
  • You mentioned a space between numbers... does a long page number look the same?
    • For example, page 201 -> "example 2 0 1 text"
  • Is there always spacing between the individual numbers?
  • If you look at the code itself using Calibre's Editor, is there any sort of pattern you can see?
    • For example, is it small and bold while the surrounding text is normal size?
    • Is the page number on its own line?
    • Is the page number in its own span?
  • [...]

Can't really help unless more examples given as well.

If the numbers all appear in the middle of the text, this might be a very hard problem... because it would be impossible to distinguish between normal numbers + page numbers.

Quote:
Originally Posted by BetterRed View Post
@notaguru - what software are you using to do the scanning, I'm not an expert but my understanding is that the better scanning software will remove the page numbers as part of the scanning process - Abby FineReader is most often mentioned as being the best.
Indeed. Did you scan/OCR this yourself? Maybe you could take care of the page number problem before/during the OCR step instead!

Last edited by Tex2002ans; 05-03-2016 at 08:50 PM.
Tex2002ans is offline   Reply With Quote