View Single Post
Old 06-21-2022, 07:20 PM   #16
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by jordy1955 View Post
I have some eBooks that were clearly produced by less than spectacular OCR software.

[...]

One of the main problems is line breaks in the wrong places (eg in the middle of a sentence), making the text very difficult to follow.
I've written about this many times over the years. Here's 2 of the topics:

Also, you may be interested in this thread:

where I broke down 5 different Regexes + color-coordinated them + explained them step-by-step.

Quote:
Originally Posted by jordy1955 View Post
Awesome stuff guys. Just ran it on a book and - once I got my head around it properly - I completed the editing and re-formatting in about 1hr - about 4 hours less than it usually takes me.
I'll get much quicker with practice but this is great.
Regular Expressions are amazing.

When you learn to search (and replace) via patterns, you can save SO MUCH TIME compared to the old way of doing searches one-by-one.

Like a few helpful ones I've used is:

Regex #1 (Full Month + Day)

Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),

It looks for:
  • "January" OR "February" OR "March" OR [...] "December"
    • Tosses it into Group 1.
  • + a space
  • + 1 or 2 numbers in a row
    • Tosses it into Group 2.
  • + a comma

which matches:
  • January 17,
  • February 20,
  • December 15,

* * * * *

Side Note #1: You could easily replace that with a:

Replace: \2 \1

to change it into a "flip the date from American -> British" regex:
  • March 16, 1999 -> 16 March 1999
  • October 1, 1776 -> 1 October 1776

* * * * *

Regex #2 (Shortened Month + Comma) (Typo)

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec),
Replace: \1.

It looks for:
  • "Jan" OR "Feb" OR [...] "Dec"
    • Captures it in Group 1.
  • + a comma

and Replaces with:
  • Whatever month got captured in Group 1.
  • + a period.

which changes:
  • Jan, 17 -> Jan. 17
  • Feb, 20 -> Feb. 20
  • Dec, 15 -> Dec. 15

Quite common in OCR—when a spec of dust can easily change a period into a comma—and it's even a common error found in tables/footnotes.

(One of the books I worked on was a multi-volume Thomas Jefferson book which cited dates of every written letter... SO many references had that typo in there!)

Last edited by Tex2002ans; 06-21-2022 at 07:34 PM.
Tex2002ans is offline   Reply With Quote