MobileRead Forums - View Single Post - Fixing hyphenation or word breaks from PDF conversion

Tex2002ans · 12-07-2023, 11:19 PM

Quote:

Originally Posted by democrite

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”.

See my ultimate summary of this "split paragraphs" problem:

2022: "False paragraph breaks & RegEx"

But here's a general breakdown of the stages/passes you should do:

Step 1: Split Paragraphs and Hyphens At End of Lines

When dealing with split paragraphs, I use the 3 regexes I explained in:

2021: "Regex examples" (Post #689)

First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.)

Step 2: Spellcheck Lists (All Hyphenated Words)

I still make heavy use of the trick I wrote about in:

2013: "How do you deal with soft hyphens in OCR texts?"

which is...

1. Use "Spellcheck Lists":

Sigil = Tools > Spellcheck > Spellcheck (Alt+Q)
Calibre = Tools > Check Spelling (Alt+F7)

2. Type in a single HYPHEN into the "Search" box.

This will give you a fully searchable/sortable list of every single hyphenated word in the book.

Step 3: Search for HYPHEN + SPACE

And Replace with HYPHEN (or NOTHING).

You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like:

Code:

red-and- white
case-by- case
In- ternally

(1st and 2nd would still require HYPHEN. The 3rd would require NOTHING.)

Quote:

Originally Posted by democrite

It seems calibre has a function mode search and replace example.

If I remember correctly:

Calibre will look for a hyphenated word.
If the unhyphenated version exists in the dictionary, it will auto-delete the hyphen.

But doing such a mass correction can accidentally remove many correct ones too.

I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above.

The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses.

The rest require manual looking/comparing, and I would never trust a "Replace All".

Quote:

Originally Posted by democrite

If there’s any other editor or tool that someone knows of, that’d be too helpful.

I've written about this about a bajillion times over the years.

In your favorite search engine, type:

Code:

hyphens regex Tex2002ans site:mobileread.com

Recently, I even wrote a ton of methods on how to do this in LibreOffice too:

That can also be found by typing this into your favorite search engine:

Code:

regex Tex2002ans site:reddit.com
newspapers Tex2002ans site:reddit.com

(Hyphens become MUCH worse in skinny columns, so I often explain correcting linebreak examples by using "newspapers".)