Quote:
Originally Posted by DiapDealer
I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate.
|
Yep, exactly. I just wrote about it again a few days ago too in:
Quote:
Originally Posted by democrite
As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary.
|
Sounds like the OCR tool you are using isn't the greatest. (Or isn't tuned properly.)
A lot of those "soft hyphens" should've been detected/squished at that level instead. That would've made your life a hell of a lot easier at this later stage.
As always, the higher quality the first stage—the OCR/text/formatting layer—the more time you'll save on all higher stages. Imagine it like a pyramid. If you have crappy foundation, it'll take MUCH longer to clean up all the mess later.
Quote:
Originally Posted by democrite
I then diff compare the changes, as I should be doing anyway.
|
Yes, that is one way.
These things should be compared per book though, not just dictionary.
(One of the tools I came up with years ago compares the book against itself. All hyphenated words get unhyphenated. If it appears elsewhere in the book, report the words to me, then I could take a closer look + quickly correct.)
Personally, I err on the side of:
- Correcting it with Spellcheck Lists.
- Then check all remaining ones.
instead of:
- "Correct" everything with dictionaries.
- Spend time comparing/readding hyphens in a swarm of diffs.
To do a mass search/replace by dictionary... a lot of otherwise correct hyphens would get changed by accident.
Doing it the "slower way" allows me to catch lots of other PDF issues too (like bad pagebreaks, footnotes-in-the-middle-of-text, etc.) + see more patterns in the book itself.
- - -
Side Note #1: For example, last month I worked on a book written by a British author.
They insisted on non-hyphenated versions of "co-op" words:
I recommended a normalization to hyphenated:
- co-opt
- co-option
- co-opted
(See
Google N-grams comparing hyphenated vs. non-hyphenated ones.)
While 14/15 cases would've worked fine using my way... then there was an extremely awkward:
which looked EXTREMELY odd with:
This meant I had to apply the same rule to ALL "co-" words throughout the book! Not just that single word/location.
If you had that change, buried within 6000 other ones, you probably would've never noticed this issue. :P
Because I was treating all "coop"/"co-op" words in the same pass, I was able to see all 15 at once in the Spellcheck Lists, then take a much closer look at each case.
- - -
Side Note #1.1: If you want more on hyphenation dropping out of popular words over time ("cooperation" vs. "co-operation" / "coöperation") or extremely rare "to-" words that don't exist anymore... see my posts in:
One of the common ones people complain about from old books is "to-day" and "to-morrow".