View Single Post
Old 04-27-2016, 03:12 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,099
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tex2002ans View Post
I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:
  • a bold page number
  • an italic page number
  • a page number on its own line
  • a page number in the middle of text.
  • a bold+italic page number
  • [###]
  • (###)
  • <b class="calibre#">###</b>
  • <b class="block#">###</b>
  • <span class="pagenumber">###</span>
  • <sup>###</sup>
  • [...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!
And all of the above in the same book (OCR of scan)

INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)
I believe the Text Paragraph the Includes the page# is near the last I fix
( I just look and do the needed REGEX now )

Learn basic REGEX,
theducks is offline   Reply With Quote