View Single Post
Old 12-07-2023, 11:19 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by democrite View Post
I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”.
See my ultimate summary of this "split paragraphs" problem:

But here's a general breakdown of the stages/passes you should do:

Step 1: Split Paragraphs and Hyphens At End of Lines

When dealing with split paragraphs, I use the 3 regexes I explained in:

First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.)

Step 2: Spellcheck Lists (All Hyphenated Words)

I still make heavy use of the trick I wrote about in:

which is...

1. Use "Spellcheck Lists":
  • Sigil = Tools > Spellcheck > Spellcheck (Alt+Q)
  • Calibre = Tools > Check Spelling (Alt+F7)

2. Type in a single HYPHEN into the "Search" box.

This will give you a fully searchable/sortable list of every single hyphenated word in the book.

Step 3: Search for HYPHEN + SPACE

And Replace with HYPHEN (or NOTHING).

You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like:

Code:
red-and- white
case-by- case
In- ternally
(1st and 2nd would still require HYPHEN. The 3rd would require NOTHING.)

Quote:
Originally Posted by democrite View Post
It seems calibre has a function mode search and replace example.
If I remember correctly:
  • Calibre will look for a hyphenated word.
  • If the unhyphenated version exists in the dictionary, it will auto-delete the hyphen.

But doing such a mass correction can accidentally remove many correct ones too.

I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above.

The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses.

The rest require manual looking/comparing, and I would never trust a "Replace All".

Quote:
Originally Posted by democrite View Post
If there’s any other editor or tool that someone knows of, that’d be too helpful.
I've written about this about a bajillion times over the years.

In your favorite search engine, type:

Code:
hyphens regex Tex2002ans site:mobileread.com
Recently, I even wrote a ton of methods on how to do this in LibreOffice too:

That can also be found by typing this into your favorite search engine:

Code:
regex Tex2002ans site:reddit.com
newspapers Tex2002ans site:reddit.com
(Hyphens become MUCH worse in skinny columns, so I often explain correcting linebreak examples by using "newspapers".)

Last edited by Tex2002ans; 12-07-2023 at 11:29 PM.
Tex2002ans is offline   Reply With Quote