Quote:
Originally Posted by jordy1955
I have some eBooks that were clearly produced by less than spectacular OCR software.
[...]
One of the main problems is line breaks in the wrong places (eg in the middle of a sentence), making the text very difficult to follow.
|
I've written about this many times over the years. Here's 2 of the topics:
Also, you may be interested in this thread:
where I broke down 5 different Regexes + color-coordinated them + explained them step-by-step.
Quote:
Originally Posted by jordy1955
Awesome stuff guys. Just ran it on a book and - once I got my head around it properly - I completed the editing and re-formatting in about 1hr - about 4 hours less than it usually takes me.
I'll get much quicker with practice but this is great.
|
Regular Expressions are amazing.
When you learn to search (and replace) via patterns, you can save SO MUCH TIME compared to the old way of doing searches one-by-one.
Like a few helpful ones I've used is:
Regex #1 (Full Month + Day)
Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),
It looks for:
- "January" OR "February" OR "March" OR [...] "December"
- + a space
- + 1 or 2 numbers in a row
- + a comma
which matches:
- January 17,
- February 20,
- December 15,
* * * * *
Side Note #1: You could easily replace that with a:
Replace: \2 \1
to change it into a "flip the date from American -> British" regex:
- March 16, 1999 -> 16 March 1999
- October 1, 1776 -> 1 October 1776
* * * * *
Regex #2 (Shortened Month + Comma) (Typo)
Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec),
Replace: \1.
It looks for:
- "Jan" OR "Feb" OR [...] "Dec"
- + a comma
and Replaces with:
- Whatever month got captured in Group 1.
- + a period.
which changes:
- Jan, 17 -> Jan. 17
- Feb, 20 -> Feb. 20
- Dec, 15 -> Dec. 15
Quite common in OCR—when a spec of dust can easily change a period into a comma—and it's even a common error found in tables/footnotes.
(One of the books I worked on was a multi-volume Thomas Jefferson book which cited dates of every written letter... SO many references had that typo in there!)