Fixing hyphenation or word breaks from PDF conversion

democrite · 12-07-2023, 10:44 PM

Hello,

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful.

Thanks.

DNSB · 12-07-2023, 11:06 PM

You might try using a regex search/replace to correct the issue.

Tex2002ans · 12-07-2023, 11:19 PM

Quote:

Originally Posted by democrite

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”.

See my ultimate summary of this "split paragraphs" problem:

2022: "False paragraph breaks & RegEx"

But here's a general breakdown of the stages/passes you should do:

Step 1: Split Paragraphs and Hyphens At End of Lines

When dealing with split paragraphs, I use the 3 regexes I explained in:

2021: "Regex examples" (Post #689)

First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.)

Step 2: Spellcheck Lists (All Hyphenated Words)

I still make heavy use of the trick I wrote about in:

2013: "How do you deal with soft hyphens in OCR texts?"

which is...

1. Use "Spellcheck Lists":

Sigil = Tools > Spellcheck > Spellcheck (Alt+Q)
Calibre = Tools > Check Spelling (Alt+F7)

2. Type in a single HYPHEN into the "Search" box.

This will give you a fully searchable/sortable list of every single hyphenated word in the book.

Step 3: Search for HYPHEN + SPACE

And Replace with HYPHEN (or NOTHING).

You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like:

Code:

red-and- white
case-by- case
In- ternally

(1st and 2nd would still require HYPHEN. The 3rd would require NOTHING.)

Quote:

Originally Posted by democrite

It seems calibre has a function mode search and replace example.

If I remember correctly:

Calibre will look for a hyphenated word.
If the unhyphenated version exists in the dictionary, it will auto-delete the hyphen.

But doing such a mass correction can accidentally remove many correct ones too.

I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above.

The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses.

The rest require manual looking/comparing, and I would never trust a "Replace All".

Quote:

Originally Posted by democrite

If there’s any other editor or tool that someone knows of, that’d be too helpful.

I've written about this about a bajillion times over the years.

In your favorite search engine, type:

Code:

hyphens regex Tex2002ans site:mobileread.com

Recently, I even wrote a ton of methods on how to do this in LibreOffice too:

That can also be found by typing this into your favorite search engine:

Code:

regex Tex2002ans site:reddit.com
newspapers Tex2002ans site:reddit.com

(Hyphens become MUCH worse in skinny columns, so I often explain correcting linebreak examples by using "newspapers".)

DiapDealer · 12-07-2023, 11:37 PM

I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate. I myself, won't touch it with a ten foot pole.

democrite · 12-08-2023, 12:48 AM

Thank you Tex2002ans. I'll take a look more at what you mentioned.

As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.

As for PDF, there are works for which there is no eBook or only PDF and they are important enough that it's worth the trouble, as I do not read but study them for years.

With variations in the regex, the calibre method works well:

https://manual.calibre-ebook.com/fun...phenated-words

I then diff compare the changes, as I should be doing anyway. Decent though I haven't checked but I think it uses the calibre dictionary and perhaps not a dict formed from terms in the EPUB so there's a bit more to do but it's ok for now.

Tex2002ans · 12-08-2023, 04:45 AM

Quote:

Originally Posted by DiapDealer

I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate.

Yep, exactly. I just wrote about it again a few days ago too in:

/r/LibreOffice: "Hello, I am a translator and need to convert PDFs into editable documents to be able to translate."

Quote:

Originally Posted by democrite

As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.

Sounds like the OCR tool you are using isn't the greatest. (Or isn't tuned properly.)

A lot of those "soft hyphens" should've been detected/squished at that level instead. That would've made your life a hell of a lot easier at this later stage.

As always, the higher quality the first stage—the OCR/text/formatting layer—the more time you'll save on all higher stages. Imagine it like a pyramid. If you have crappy foundation, it'll take MUCH longer to clean up all the mess later.

Quote:

Originally Posted by democrite

I then diff compare the changes, as I should be doing anyway.

Yes, that is one way.

These things should be compared per book though, not just dictionary.

(One of the tools I came up with years ago compares the book against itself. All hyphenated words get unhyphenated. If it appears elsewhere in the book, report the words to me, then I could take a closer look + quickly correct.)

Personally, I err on the side of:

Correcting it with Spellcheck Lists.
Then check all remaining ones.

instead of:

"Correct" everything with dictionaries.
Spend time comparing/readding hyphens in a swarm of diffs.

To do a mass search/replace by dictionary... a lot of otherwise correct hyphens would get changed by accident.

Doing it the "slower way" allows me to catch lots of other PDF issues too (like bad pagebreaks, footnotes-in-the-middle-of-text, etc.) + see more patterns in the book itself.

- - -

Side Note #1: For example, last month I worked on a book written by a British author.

They insisted on non-hyphenated versions of "co-op" words:

coopt
cooption
coopted

I recommended a normalization to hyphenated:

co-opt
co-option
co-opted

(See Google N-grams comparing hyphenated vs. non-hyphenated ones.)

While 14/15 cases would've worked fine using my way... then there was an extremely awkward:

Pharma-coopted

which looked EXTREMELY odd with:

Pharma-co-opted

This meant I had to apply the same rule to ALL "co-" words throughout the book! Not just that single word/location.

If you had that change, buried within 6000 other ones, you probably would've never noticed this issue. :P

Because I was treating all "coop"/"co-op" words in the same pass, I was able to see all 15 at once in the Spellcheck Lists, then take a much closer look at each case.

- - -

Side Note #1.1: If you want more on hyphenation dropping out of popular words over time ("cooperation" vs. "co-operation" / "coöperation") or extremely rare "to-" words that don't exist anymore... see my posts in:

2022: "The end of "THE END"?"

One of the common ones people complain about from old books is "to-day" and "to-morrow".

democrite · 12-09-2023, 05:40 PM

Thanks for the continued help. It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.

JSWolf · 12-09-2023, 05:43 PM

Quote:

Originally Posted by democrite

Hello,

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful.

Thanks.

You really should A/B compare the PDF to the output. Even if you use all the tricks posted in this thread, there will still be errors. That means A>b comparing every character, every space, every punctuation, everything.

Tex2002ans · 12-09-2023, 09:19 PM

Quote:

Originally Posted by democrite

It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.

What language? Share a sample.

democrite · 12-10-2023, 07:31 AM

A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary.

Exported PDF to EPUB with one of the numerous cheapo apps. Results not bad though took a few days to get into decent shape and could take months to fix. A work that is worth it to me as I'll study the subject as I fix.

I was able to take the EPUB, convert to txt in calibre, make a calibre custom dictionary, and use the calibre function from above. That takes care of most terms. There's still word breaks between paragraphs e.g. "some-" next paragraph "day". I haven't been able to figure out a regex for that.

Possibly I'll do more PDF conversions and there are professional apps that publishers use to import to a desktop publishing app, e.g. perhaps if they are left with only some print, old proof, and need to reprint to revise. I haven't tried such yet I imagine that they'd significantly reduce effort and perhaps if publishers rely on them, they might be not bad.

KevinH · 12-10-2023, 08:38 AM

This entire thread really belongs in the epub forum not the Sigil forum. There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.

Tex2002ans · 12-10-2023, 01:48 PM

Quote:

Originally Posted by KevinH

This entire thread really belongs in the epub forum not the Sigil forum.

Yes, I'd say move it to the EPUB (or Workshop) section.

Quote:

Originally Posted by KevinH

There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.

Can you describe some of your ideas?

Quote:

Originally Posted by democrite

A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary.

Which language? Which words? It's like we're pulling teeth here! The more you can share, the easier it'll be to tackle your specific issues.

Quote:

Originally Posted by democrite

Exported PDF to EPUB with one of the numerous cheapo apps.

And there's the root cause of a lot of this. Soft hyphen detection is key.

So many of the crappy PDF->something apps just treat all "line-ending hyphens" as "hard hyphens", so they'll appear in the EPUB. And as you can see, that produces THOUSANDS of them that you'll have to correct.

In the case of Finereader, it narrows it down to a handful.

democrite · 12-10-2023, 06:24 PM

I was doing what i could with what i had. I've created many epubs and as they say this is not my first rodeo. Usually use ABBYY but wanted text export and some best way for such. I am decently aware of alternatives but was asking for just this. I'll deal with any issues. Next time perhaps I need to get one of those pricy Quark or InDesign plugins as they've been around for years and perhaps decently deal with PDF conversion.

Turtle91 · 12-10-2023, 06:36 PM

Quote:

Originally Posted by Tex2002ans

Can you describe some of your ideas?

I think he is referring to the existing capabilities in Sigil.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Amazon AZW3 conversion to PDF creating too many breaks	Maddcow	Conversion	3	10-11-2017 10:56 AM
PDF-HTML conversion to Word	NielsTF	Writers' Corner	2	12-30-2016 12:55 AM
Epub to Mobi conversion strange word breaks	inkwords	Conversion	0	03-04-2012 05:29 PM
Kindle 3 PDF Conversion Line Breaks	mvnjpy	Calibre	3	09-26-2010 09:36 PM
PDF conversion breaks links, TOC	ToddA	Calibre	3	02-06-2010 04:43 AM

12-07-2023, 10:44 PM	#1
democrite Evangelist Posts: 453 Karma: 77256 Join Date: Sep 2011 Device: none	Fixing hyphenation or word breaks from PDF conversion Hello, I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful. Thanks.

12-07-2023, 11:06 PM	#2
DNSB Bibliophagist Posts: 52,944 Karma: 180988376 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	You might try using a regex search/replace to correct the issue.

12-07-2023, 11:37 PM	#4
DiapDealer Grand Sorcerer Posts: 29,552 Karma: 212177546 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate. I myself, won't touch it with a ten foot pole.

12-08-2023, 12:48 AM	#5
democrite Evangelist Posts: 453 Karma: 77256 Join Date: Sep 2011 Device: none	Thank you Tex2002ans. I'll take a look more at what you mentioned. As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary. As for PDF, there are works for which there is no eBook or only PDF and they are important enough that it's worth the trouble, as I do not read but study them for years. With variations in the regex, the calibre method works well: https://manual.calibre-ebook.com/fun...phenated-words I then diff compare the changes, as I should be doing anyway. Decent though I haven't checked but I think it uses the calibre dictionary and perhaps not a dict formed from terms in the EPUB so there's a bit more to do but it's ok for now.

12-09-2023, 05:40 PM	#7
democrite Evangelist Posts: 453 Karma: 77256 Join Date: Sep 2011 Device: none	Thanks for the continued help. It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.

12-10-2023, 07:31 AM	#10
democrite Evangelist Posts: 453 Karma: 77256 Join Date: Sep 2011 Device: none	A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary. Exported PDF to EPUB with one of the numerous cheapo apps. Results not bad though took a few days to get into decent shape and could take months to fix. A work that is worth it to me as I'll study the subject as I fix. I was able to take the EPUB, convert to txt in calibre, make a calibre custom dictionary, and use the calibre function from above. That takes care of most terms. There's still word breaks between paragraphs e.g. "some-" next paragraph "day". I haven't been able to figure out a regex for that. Possibly I'll do more PDF conversions and there are professional apps that publishers use to import to a desktop publishing app, e.g. perhaps if they are left with only some print, old proof, and need to reprint to revise. I haven't tried such yet I imagine that they'd significantly reduce effort and perhaps if publishers rely on them, they might be not bad.

12-10-2023, 08:38 AM	#11
KevinH Sigil Developer Posts: 9,771 Karma: 7000000 Join Date: Nov 2009 Device: many	This entire thread really belongs in the epub forum not the Sigil forum. There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.

12-10-2023, 06:24 PM	#13
democrite Evangelist Posts: 453 Karma: 77256 Join Date: Sep 2011 Device: none	I was doing what i could with what i had. I've created many epubs and as they say this is not my first rodeo. Usually use ABBYY but wanted text export and some best way for such. I am decently aware of alternatives but was asking for just this. I'll deal with any issues. Next time perhaps I need to get one of those pricy Quark or InDesign plugins as they've been around for years and perhaps decently deal with PDF conversion.

Advert

Advert