Quote:
Originally Posted by democrite
I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”.
|
See my ultimate summary of this "split paragraphs" problem:
But here's a general breakdown of the stages/passes you should do:
Step 1: Split Paragraphs and Hyphens At End of Lines
When dealing with split paragraphs, I use the 3 regexes I explained in:
First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.)
Step 2: Spellcheck Lists (All Hyphenated Words)
I still make heavy use of the trick I wrote about in:
which is...
1. Use "Spellcheck Lists":
- Sigil = Tools > Spellcheck > Spellcheck (Alt+Q)
- Calibre = Tools > Check Spelling (Alt+F7)
2. Type in a single HYPHEN into the "Search" box.
This will give you a fully searchable/sortable list of every single hyphenated word in the book.
Step 3: Search for HYPHEN + SPACE
And Replace with HYPHEN (or NOTHING).
You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like:
Code:
red-and- white
case-by- case
In- ternally
(1st and 2nd would still require HYPHEN. The 3rd would require NOTHING.)
Quote:
Originally Posted by democrite
It seems calibre has a function mode search and replace example.
|
If I remember correctly:
- Calibre will look for a hyphenated word.
- If the unhyphenated version exists in the dictionary, it will auto-delete the hyphen.
But doing such a mass correction can accidentally remove many correct ones too.
I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above.
The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses.
The rest require manual looking/comparing, and I would never trust a "Replace All".
Quote:
Originally Posted by democrite
If there’s any other editor or tool that someone knows of, that’d be too helpful.
|
I've written about this about a bajillion times over the years.
In your favorite search engine, type:
Code:
hyphens regex Tex2002ans site:mobileread.com
Recently, I even wrote a ton of methods on how to do this in LibreOffice too:
That can also be found by typing this into your favorite search engine:
Code:
regex Tex2002ans site:reddit.com
newspapers Tex2002ans site:reddit.com
(Hyphens become MUCH worse in skinny columns, so I often explain correcting linebreak examples by using "newspapers".)