Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 12-07-2023, 10:44 PM   #1
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
Fixing hyphenation or word breaks from PDF conversion

Hello,

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful.

Thanks.
democrite is offline   Reply With Quote
Old 12-07-2023, 11:06 PM   #2
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,464
Karma: 145525534
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
You might try using a regex search/replace to correct the issue.
DNSB is offline   Reply With Quote
Old 12-07-2023, 11:19 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by democrite View Post
I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”.
See my ultimate summary of this "split paragraphs" problem:

But here's a general breakdown of the stages/passes you should do:

Step 1: Split Paragraphs and Hyphens At End of Lines

When dealing with split paragraphs, I use the 3 regexes I explained in:

First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.)

Step 2: Spellcheck Lists (All Hyphenated Words)

I still make heavy use of the trick I wrote about in:

which is...

1. Use "Spellcheck Lists":
  • Sigil = Tools > Spellcheck > Spellcheck (Alt+Q)
  • Calibre = Tools > Check Spelling (Alt+F7)

2. Type in a single HYPHEN into the "Search" box.

This will give you a fully searchable/sortable list of every single hyphenated word in the book.

Step 3: Search for HYPHEN + SPACE

And Replace with HYPHEN (or NOTHING).

You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like:

Code:
red-and- white
case-by- case
In- ternally
(1st and 2nd would still require HYPHEN. The 3rd would require NOTHING.)

Quote:
Originally Posted by democrite View Post
It seems calibre has a function mode search and replace example.
If I remember correctly:
  • Calibre will look for a hyphenated word.
  • If the unhyphenated version exists in the dictionary, it will auto-delete the hyphen.

But doing such a mass correction can accidentally remove many correct ones too.

I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above.

The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses.

The rest require manual looking/comparing, and I would never trust a "Replace All".

Quote:
Originally Posted by democrite View Post
If there’s any other editor or tool that someone knows of, that’d be too helpful.
I've written about this about a bajillion times over the years.

In your favorite search engine, type:

Code:
hyphens regex Tex2002ans site:mobileread.com
Recently, I even wrote a ton of methods on how to do this in LibreOffice too:

That can also be found by typing this into your favorite search engine:

Code:
regex Tex2002ans site:reddit.com
newspapers Tex2002ans site:reddit.com
(Hyphens become MUCH worse in skinny columns, so I often explain correcting linebreak examples by using "newspapers".)

Last edited by Tex2002ans; 12-07-2023 at 11:29 PM.
Tex2002ans is offline   Reply With Quote
Old 12-07-2023, 11:37 PM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate. I myself, won't touch it with a ten foot pole.
DiapDealer is offline   Reply With Quote
Old 12-08-2023, 12:48 AM   #5
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
Thank you Tex2002ans. I'll take a look more at what you mentioned.

As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.

As for PDF, there are works for which there is no eBook or only PDF and they are important enough that it's worth the trouble, as I do not read but study them for years.

With variations in the regex, the calibre method works well:

https://manual.calibre-ebook.com/fun...phenated-words

I then diff compare the changes, as I should be doing anyway. Decent though I haven't checked but I think it uses the calibre dictionary and perhaps not a dict formed from terms in the EPUB so there's a bit more to do but it's ok for now.
democrite is offline   Reply With Quote
Old 12-08-2023, 04:45 AM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by DiapDealer View Post
I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate.
Yep, exactly. I just wrote about it again a few days ago too in:

Quote:
Originally Posted by democrite View Post
As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.
Sounds like the OCR tool you are using isn't the greatest. (Or isn't tuned properly.)

A lot of those "soft hyphens" should've been detected/squished at that level instead. That would've made your life a hell of a lot easier at this later stage.

As always, the higher quality the first stage—the OCR/text/formatting layer—the more time you'll save on all higher stages. Imagine it like a pyramid. If you have crappy foundation, it'll take MUCH longer to clean up all the mess later.

Quote:
Originally Posted by democrite View Post
I then diff compare the changes, as I should be doing anyway.
Yes, that is one way.

These things should be compared per book though, not just dictionary.

(One of the tools I came up with years ago compares the book against itself. All hyphenated words get unhyphenated. If it appears elsewhere in the book, report the words to me, then I could take a closer look + quickly correct.)

Personally, I err on the side of:
  • Correcting it with Spellcheck Lists.
  • Then check all remaining ones.

instead of:
  • "Correct" everything with dictionaries.
  • Spend time comparing/readding hyphens in a swarm of diffs.

To do a mass search/replace by dictionary... a lot of otherwise correct hyphens would get changed by accident.

Doing it the "slower way" allows me to catch lots of other PDF issues too (like bad pagebreaks, footnotes-in-the-middle-of-text, etc.) + see more patterns in the book itself.

- - -

Side Note #1: For example, last month I worked on a book written by a British author.

They insisted on non-hyphenated versions of "co-op" words:
  • coopt
  • cooption
  • coopted

I recommended a normalization to hyphenated:
  • co-opt
  • co-option
  • co-opted

(See Google N-grams comparing hyphenated vs. non-hyphenated ones.)

While 14/15 cases would've worked fine using my way... then there was an extremely awkward:
  • Pharma-coopted

which looked EXTREMELY odd with:
  • Pharma-co-opted

This meant I had to apply the same rule to ALL "co-" words throughout the book! Not just that single word/location.

If you had that change, buried within 6000 other ones, you probably would've never noticed this issue. :P

Because I was treating all "coop"/"co-op" words in the same pass, I was able to see all 15 at once in the Spellcheck Lists, then take a much closer look at each case.

- - -

Side Note #1.1: If you want more on hyphenation dropping out of popular words over time ("cooperation" vs. "co-operation" / "coöperation") or extremely rare "to-" words that don't exist anymore... see my posts in:

One of the common ones people complain about from old books is "to-day" and "to-morrow".

Last edited by Tex2002ans; 12-08-2023 at 04:51 AM.
Tex2002ans is offline   Reply With Quote
Old 12-09-2023, 05:40 PM   #7
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
Thanks for the continued help. It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.
democrite is offline   Reply With Quote
Old 12-09-2023, 05:43 PM   #8
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,015
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by democrite View Post
Hello,

I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful.

Thanks.
You really should A/B compare the PDF to the output. Even if you use all the tricks posted in this thread, there will still be errors. That means A>b comparing every character, every space, every punctuation, everything.
JSWolf is offline   Reply With Quote
Old 12-09-2023, 09:19 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by democrite View Post
It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.
What language? Share a sample.
Tex2002ans is offline   Reply With Quote
Old 12-10-2023, 07:31 AM   #10
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary.

Exported PDF to EPUB with one of the numerous cheapo apps. Results not bad though took a few days to get into decent shape and could take months to fix. A work that is worth it to me as I'll study the subject as I fix.

I was able to take the EPUB, convert to txt in calibre, make a calibre custom dictionary, and use the calibre function from above. That takes care of most terms. There's still word breaks between paragraphs e.g. "some-" next paragraph "day". I haven't been able to figure out a regex for that.

Possibly I'll do more PDF conversions and there are professional apps that publishers use to import to a desktop publishing app, e.g. perhaps if they are left with only some print, old proof, and need to reprint to revise. I haven't tried such yet I imagine that they'd significantly reduce effort and perhaps if publishers rely on them, they might be not bad.
democrite is offline   Reply With Quote
Old 12-10-2023, 08:38 AM   #11
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,647
Karma: 5433388
Join Date: Nov 2009
Device: many
This entire thread really belongs in the epub forum not the Sigil forum. There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.
KevinH is offline   Reply With Quote
Old 12-10-2023, 01:48 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by KevinH View Post
This entire thread really belongs in the epub forum not the Sigil forum.
Yes, I'd say move it to the EPUB (or Workshop) section.

Quote:
Originally Posted by KevinH View Post
There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.
Can you describe some of your ideas?

Quote:
Originally Posted by democrite View Post
A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary.
Which language? Which words? It's like we're pulling teeth here! The more you can share, the easier it'll be to tackle your specific issues.

Quote:
Originally Posted by democrite View Post
Exported PDF to EPUB with one of the numerous cheapo apps.
And there's the root cause of a lot of this. Soft hyphen detection is key.

So many of the crappy PDF->something apps just treat all "line-ending hyphens" as "hard hyphens", so they'll appear in the EPUB. And as you can see, that produces THOUSANDS of them that you'll have to correct.

In the case of Finereader, it narrows it down to a handful.

Last edited by Tex2002ans; 12-10-2023 at 01:52 PM.
Tex2002ans is offline   Reply With Quote
Old 12-10-2023, 06:24 PM   #13
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 425
Karma: 77256
Join Date: Sep 2011
Device: none
I was doing what i could with what i had. I've created many epubs and as they say this is not my first rodeo. Usually use ABBYY but wanted text export and some best way for such. I am decently aware of alternatives but was asking for just this. I'll deal with any issues. Next time perhaps I need to get one of those pricy Quark or InDesign plugins as they've been around for years and perhaps decently deal with PDF conversion.
democrite is offline   Reply With Quote
Old 12-10-2023, 06:36 PM   #14
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,097
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Quote:
Originally Posted by Tex2002ans View Post
Can you describe some of your ideas?
I think he is referring to the existing capabilities in Sigil.
Turtle91 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Amazon AZW3 conversion to PDF creating too many breaks Maddcow Conversion 3 10-11-2017 10:56 AM
PDF-HTML conversion to Word NielsTF Writers' Corner 2 12-30-2016 12:55 AM
Epub to Mobi conversion strange word breaks inkwords Conversion 0 03-04-2012 05:29 PM
Kindle 3 PDF Conversion Line Breaks mvnjpy Calibre 3 09-26-2010 09:36 PM
PDF conversion breaks links, TOC ToddA Calibre 3 02-06-2010 04:43 AM


All times are GMT -4. The time now is 05:23 PM.


MobileRead.com is a privately owned, operated and funded community.