Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-05-2012, 06:00 PM   #1
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
Heuristic "Remove unnecessary hyphens" not working?

Does this feature actually work?

I converted a retail .PDF book into .htmlz, and then fixed all of the broken quotes and paragraphs using my standard regex searches, no issues there. However, in the .PDF of this particular book, the publisher fubar'ed it by using the same normal 'dash' character for end of line hyphenation as for compound words, so it's not simple to fix with over 1100 occurrences of the "-" character.

For example: "tight-lipped" is a proper compound word that should have the 'dash', but "im-mersed" was hyphenated at the end of a line in the original .PDF and should have the 'dash' removed.

I tried enabling the heuristics "remove unnecessary hyphens" option when I converted the .htmlz to .epub, hoping it would fix this, but it makes no difference, none of the dashes are removed during the conversion.

Any ideas?

Cheers
The REAL Joe
therealjoeblow is offline   Reply With Quote
Old 03-05-2012, 08:27 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
edit - if none of the dashes were removed this sounds like a bug - open bug with your book and I can check it out.

The way that heuristics determines if a hyphen should be removed relies on using the book itself as a dictionary. This approach works pretty well, but is not a 100% guarantee that every hyphen that needs to be removed will be. It does pretty much guarantee that any hyphens that should be kept will be, which is the more important thing IMHO.

It does do some rudimentary stemming of the words, so in your example 'im-mersed' would be shortened to 'immers', and the document would be checked to see if the text 'immers' existed in the book - I think what you'll find if you double-check your book is that immers can't be found anywhere.

The only way to improve on this further is to allow the user to specify an external dictionary/wordlist, but there hasn't been all that much interest in further improving the feature.
ldolse is offline   Reply With Quote
Advert
Old 03-06-2012, 10:21 AM   #3
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
OK, I opened a bug report. When looking for the incorrectly hyphenated words in the attachment files, start AFTER chapter 13, because I manually corrected them up to there before I got tired and looked for a more automated solution.

Cheers
The REAL Joe
therealjoeblow is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove from Archive (book already "deleted" in Amazon account) kindletommy Amazon Kindle 9 08-09-2012 06:17 PM
How do I remove things like "Fetch News" from my calibre? 20LEgend Calibre 3 03-02-2012 03:29 PM
How do I remove the "Archived" Book shelf from my nook color? leesiulung Nook Color & Nook Tablet 0 02-24-2011 03:02 PM
How to remove "Fully read" books from "Last Open" list? pjeanetta PocketBook 4 12-08-2010 10:30 AM


All times are GMT -4. The time now is 05:09 AM.


MobileRead.com is a privately owned, operated and funded community.