![]() |
#1 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Fixing hyphenation or word breaks from PDF conversion
Hello,
I was converting from PDF and the resulting EPUB has numerous end of line word breaks such as “In - ternally” or “red-and- white”. It seems calibre has a function mode search and replace example. I’ll try such soon yet I was wondering if there was some plugin or perhaps someone could help. I’m kind of hesitant to use something dictionary based yet if a plugin could scan the EPUB for words to form a dictionary and then fix such from that, I think that’d be preferred. If there’s any other editor or tool that someone knows of, that’d be too helpful. Thanks. |
![]() |
![]() |
![]() |
#2 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,451
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
You might try using a regex search/replace to correct the issue.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But here's a general breakdown of the stages/passes you should do: Step 1: Split Paragraphs and Hyphens At End of Lines When dealing with split paragraphs, I use the 3 regexes I explained in: First one takes care of hyphens at the end of lines. (And I explain all sorts of edge-cases you will run across and need to pay attention to.) Step 2: Spellcheck Lists (All Hyphenated Words) I still make heavy use of the trick I wrote about in: which is... 1. Use "Spellcheck Lists":
2. Type in a single HYPHEN into the "Search" box. This will give you a fully searchable/sortable list of every single hyphenated word in the book. Step 3: Search for HYPHEN + SPACE And Replace with HYPHEN (or NOTHING). You'll have to go through the entire book in a case-by-case basis. There shouldn't be many left. This will catch your leftover examples like: Code:
red-and- white case-by- case In- ternally Quote:
But doing such a mass correction can accidentally remove many correct ones too. I personally do everything one-by-one, on a case-by-case basis. Doesn't take very long using the methods above. The first 3 Regexes should take care of the VAST majority of OCR errors very quickly + in a few button presses. The rest require manual looking/comparing, and I would never trust a "Replace All". Quote:
In your favorite search engine, type: Code:
hyphens regex Tex2002ans site:mobileread.com
That can also be found by typing this into your favorite search engine: Code:
regex Tex2002ans site:reddit.com newspapers Tex2002ans site:reddit.com Last edited by Tex2002ans; 12-07-2023 at 11:29 PM. |
|||
![]() |
![]() |
![]() |
#4 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,335
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I'm sure people get sick of hearing it, but the bottom line is that PDF is a destination format and not a source format. There is no magic formula to make converting from PDF easy, quick or accurate. I myself, won't touch it with a ten foot pole.
|
![]() |
![]() |
![]() |
#5 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Thank you Tex2002ans. I'll take a look more at what you mentioned.
As there are something like 6000+ occurrences, regular regex doesn't help too much I thin k, particularly when wanting to verify any changes are indeed wanted, e.g. maybe there should be a hyphen and merely remove the space, or mis-match of terms against a dictionary. As for PDF, there are works for which there is no eBook or only PDF and they are important enough that it's worth the trouble, as I do not read but study them for years. With variations in the regex, the calibre method works well: https://manual.calibre-ebook.com/fun...phenated-words I then diff compare the changes, as I should be doing anyway. Decent though I haven't checked but I think it uses the calibre dictionary and perhaps not a dict formed from terms in the EPUB so there's a bit more to do but it's ok for now. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
A lot of those "soft hyphens" should've been detected/squished at that level instead. That would've made your life a hell of a lot easier at this later stage. As always, the higher quality the first stage—the OCR/text/formatting layer—the more time you'll save on all higher stages. Imagine it like a pyramid. If you have crappy foundation, it'll take MUCH longer to clean up all the mess later. Yes, that is one way. These things should be compared per book though, not just dictionary. (One of the tools I came up with years ago compares the book against itself. All hyphenated words get unhyphenated. If it appears elsewhere in the book, report the words to me, then I could take a closer look + quickly correct.) Personally, I err on the side of:
instead of:
To do a mass search/replace by dictionary... a lot of otherwise correct hyphens would get changed by accident. Doing it the "slower way" allows me to catch lots of other PDF issues too (like bad pagebreaks, footnotes-in-the-middle-of-text, etc.) + see more patterns in the book itself. - - - Side Note #1: For example, last month I worked on a book written by a British author. They insisted on non-hyphenated versions of "co-op" words:
I recommended a normalization to hyphenated:
(See Google N-grams comparing hyphenated vs. non-hyphenated ones.) While 14/15 cases would've worked fine using my way... then there was an extremely awkward:
which looked EXTREMELY odd with:
This meant I had to apply the same rule to ALL "co-" words throughout the book! Not just that single word/location. If you had that change, buried within 6000 other ones, you probably would've never noticed this issue. :P Because I was treating all "coop"/"co-op" words in the same pass, I was able to see all 15 at once in the Spellcheck Lists, then take a much closer look at each case. ![]() - - - Side Note #1.1: If you want more on hyphenation dropping out of popular words over time ("cooperation" vs. "co-operation" / "coöperation") or extremely rare "to-" words that don't exist anymore... see my posts in: One of the common ones people complain about from old books is "to-day" and "to-morrow". ![]() Last edited by Tex2002ans; 12-08-2023 at 04:51 AM. |
||
![]() |
![]() |
![]() |
#7 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Thanks for the continued help. It wasn't OCR text export from a commercial PDF eBook. Multilanguage with one in particular I didn't think would OCR well so I'm stuck with figuring out how to automate this or script in some language other issues.
|
![]() |
![]() |
![]() |
#8 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,914
Karma: 143098300
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
|
![]() |
![]() |
![]() |
#10 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
A different language plus specialized scientific terms for which perhaps I couldn't find a dictionary.
Exported PDF to EPUB with one of the numerous cheapo apps. Results not bad though took a few days to get into decent shape and could take months to fix. A work that is worth it to me as I'll study the subject as I fix. I was able to take the EPUB, convert to txt in calibre, make a calibre custom dictionary, and use the calibre function from above. That takes care of most terms. There's still word breaks between paragraphs e.g. "some-" next paragraph "day". I haven't been able to figure out a regex for that. Possibly I'll do more PDF conversions and there are professional apps that publishers use to import to a desktop publishing app, e.g. perhaps if they are left with only some print, old proof, and need to reprint to revise. I haven't tried such yet I imagine that they'd significantly reduce effort and perhaps if publishers rely on them, they might be not bad. |
![]() |
![]() |
![]() |
#11 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,434
Karma: 5702578
Join Date: Nov 2009
Device: many
|
This entire thread really belongs in the epub forum not the Sigil forum. There are many tools including regex and the ability to create a table of before and after potential replacements that can be easily scrolled over and just the replacements you do not want removed. There are also regex python replacement functions (built-in or via plugin) that can be used as well.
|
![]() |
![]() |
![]() |
#12 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Quote:
And there's the root cause of a lot of this. Soft hyphen detection is key. So many of the crappy PDF->something apps just treat all "line-ending hyphens" as "hard hyphens", so they'll appear in the EPUB. And as you can see, that produces THOUSANDS of them that you'll have to correct. In the case of Finereader, it narrows it down to a handful. Last edited by Tex2002ans; 12-10-2023 at 01:52 PM. |
|||
![]() |
![]() |
![]() |
#13 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
I was doing what i could with what i had. I've created many epubs and as they say this is not my first rodeo. Usually use ABBYY but wanted text export and some best way for such. I am decently aware of alternatives but was asking for just this. I'll deal with any issues. Next time perhaps I need to get one of those pricy Quark or InDesign plugins as they've been around for years and perhaps decently deal with PDF conversion.
|
![]() |
![]() |
![]() |
#14 |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,291
Karma: 20171067
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Amazon AZW3 conversion to PDF creating too many breaks | Maddcow | Conversion | 3 | 10-11-2017 10:56 AM |
PDF-HTML conversion to Word | NielsTF | Writers' Corner | 2 | 12-30-2016 12:55 AM |
Epub to Mobi conversion strange word breaks | inkwords | Conversion | 0 | 03-04-2012 05:29 PM |
Kindle 3 PDF Conversion Line Breaks | mvnjpy | Calibre | 3 | 09-26-2010 09:36 PM |
PDF conversion breaks links, TOC | ToddA | Calibre | 3 | 02-06-2010 04:43 AM |