![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
|
Heuristic "Remove unnecessary hyphens" not working?
Does this feature actually work?
I converted a retail .PDF book into .htmlz, and then fixed all of the broken quotes and paragraphs using my standard regex searches, no issues there. However, in the .PDF of this particular book, the publisher fubar'ed it by using the same normal 'dash' character for end of line hyphenation as for compound words, so it's not simple to fix with over 1100 occurrences of the "-" character. For example: "tight-lipped" is a proper compound word that should have the 'dash', but "im-mersed" was hyphenated at the end of a line in the original .PDF and should have the 'dash' removed. I tried enabling the heuristics "remove unnecessary hyphens" option when I converted the .htmlz to .epub, hoping it would fix this, but it makes no difference, none of the dashes are removed during the conversion. Any ideas? Cheers The REAL Joe |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
edit - if none of the dashes were removed this sounds like a bug - open bug with your book and I can check it out.
The way that heuristics determines if a hyphen should be removed relies on using the book itself as a dictionary. This approach works pretty well, but is not a 100% guarantee that every hyphen that needs to be removed will be. It does pretty much guarantee that any hyphens that should be kept will be, which is the more important thing IMHO. It does do some rudimentary stemming of the words, so in your example 'im-mersed' would be shortened to 'immers', and the document would be checked to see if the text 'immers' existed in the book - I think what you'll find if you double-check your book is that immers can't be found anywhere. The only way to improve on this further is to allow the user to specify an external dictionary/wordlist, but there hasn't been all that much interest in further improving the feature. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
|
OK, I opened a bug report. When looking for the incorrectly hyphenated words in the attachment files, start AFTER chapter 13, because I manually corrected them up to there before I got tired and looked for a more automated solution.
Cheers The REAL Joe |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Remove from Archive (book already "deleted" in Amazon account) | kindletommy | Amazon Kindle | 9 | 08-09-2012 06:17 PM |
How do I remove things like "Fetch News" from my calibre? | 20LEgend | Calibre | 3 | 03-02-2012 03:29 PM |
How do I remove the "Archived" Book shelf from my nook color? | leesiulung | Nook Color & Nook Tablet | 0 | 02-24-2011 03:02 PM |
How to remove "Fully read" books from "Last Open" list? | pjeanetta | PocketBook | 4 | 12-08-2010 10:30 AM |