View Single Post
Old 12-08-2023, 04:45 AM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by DiapDealer View Post
I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate.
Yep, exactly. I just wrote about it again a few days ago too in:

Quote:
Originally Posted by democrite View Post
As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.
Sounds like the OCR tool you are using isn't the greatest. (Or isn't tuned properly.)

A lot of those "soft hyphens" should've been detected/squished at that level instead. That would've made your life a hell of a lot easier at this later stage.

As always, the higher quality the first stage—the OCR/text/formatting layer—the more time you'll save on all higher stages. Imagine it like a pyramid. If you have crappy foundation, it'll take MUCH longer to clean up all the mess later.

Quote:
Originally Posted by democrite View Post
I then diff compare the changes, as I should be doing anyway.
Yes, that is one way.

These things should be compared per book though, not just dictionary.

(One of the tools I came up with years ago compares the book against itself. All hyphenated words get unhyphenated. If it appears elsewhere in the book, report the words to me, then I could take a closer look + quickly correct.)

Personally, I err on the side of:
  • Correcting it with Spellcheck Lists.
  • Then check all remaining ones.

instead of:
  • "Correct" everything with dictionaries.
  • Spend time comparing/readding hyphens in a swarm of diffs.

To do a mass search/replace by dictionary... a lot of otherwise correct hyphens would get changed by accident.

Doing it the "slower way" allows me to catch lots of other PDF issues too (like bad pagebreaks, footnotes-in-the-middle-of-text, etc.) + see more patterns in the book itself.

- - -

Side Note #1: For example, last month I worked on a book written by a British author.

They insisted on non-hyphenated versions of "co-op" words:
  • coopt
  • cooption
  • coopted

I recommended a normalization to hyphenated:
  • co-opt
  • co-option
  • co-opted

(See Google N-grams comparing hyphenated vs. non-hyphenated ones.)

While 14/15 cases would've worked fine using my way... then there was an extremely awkward:
  • Pharma-coopted

which looked EXTREMELY odd with:
  • Pharma-co-opted

This meant I had to apply the same rule to ALL "co-" words throughout the book! Not just that single word/location.

If you had that change, buried within 6000 other ones, you probably would've never noticed this issue. :P

Because I was treating all "coop"/"co-op" words in the same pass, I was able to see all 15 at once in the Spellcheck Lists, then take a much closer look at each case.

- - -

Side Note #1.1: If you want more on hyphenation dropping out of popular words over time ("cooperation" vs. "co-operation" / "coöperation") or extremely rare "to-" words that don't exist anymore... see my posts in:

One of the common ones people complain about from old books is "to-day" and "to-morrow".

Last edited by Tex2002ans; 12-08-2023 at 04:51 AM.
Tex2002ans is offline   Reply With Quote