Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-29-2024, 11:00 AM   #1
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
Space character before Soft Hyphen missed in epub>docx conversion

Hello.

I ran into a probably quite rare conversion problem: A Soft Hyphen, when inserted before a word (ie between a space character and the first character of a word), can cause the space character not to be converted (epub > docx).

Naturally, I am aware that such placement is very strange for Soft Hyphen. I don't know how such Soft Hyphens could get into epub. In any case, I have seen several documents exported in this way, and it is very difficult to fix them without access to the source epub file…



sample epub+docx attached
Calibre version: 6.26
Attached Files
File Type: epub CalibreShyExportErrorDemo.epub (3.1 KB, 24 views)
File Type: docx CalibreShyExportErrorDemo.docx (4.1 KB, 23 views)
quinta@ebf.cz is offline   Reply With Quote
Old 02-29-2024, 11:09 AM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,015
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
The soft hyphens should be removed as they don't work well in ePub. Most software won't display them properly and searching will not work.
JSWolf is offline   Reply With Quote
Advert
Old 02-29-2024, 11:26 AM   #3
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
OK, I'm also not a fan of using Soft Hyphens in epub. But that doesn't change the fact that Calibre's conversion behavior appears to be flawed. And that this error is very problematic, especially if you have no control over the epub and only have to work with the conversion output.
quinta@ebf.cz is offline   Reply With Quote
Old 02-29-2024, 11:32 AM   #4
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,164
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Soft hyphens are for websites. Ebooks should leave it to the renderer to hyphenate.
There should never be a space on any side of a soft hyphen as the only reason for them is a hint as to where to break in a word. Unexpected never-to-be encountered formatting can break conversions.

Someone has used a crazy tool or formatting of the epub. Fix the epub by deleting all soft hyphens before conversion to docx using the Calibre editor, which should be easy.

Auto-hyphens also need to be off on WP source for creating an ebook and on for PDF, because the Wordprocessor has no knowledge of page width of an ebook, but does have the page size set for a PDF.

Last edited by Quoth; 02-29-2024 at 11:36 AM.
Quoth is offline   Reply With Quote
Old 02-29-2024, 11:48 AM   #5
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
Quoth: Thank you, I agree, but once again: Source epub file is not always available. In such case you have to work with conversion output. And this is quite difficult without space characters.
quinta@ebf.cz is offline   Reply With Quote
Advert
Old 02-29-2024, 12:12 PM   #6
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,464
Karma: 145525534
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Since AFAIR, a soft hyphen is an invisible format character indicating a possible hyphenation location, placing a soft hyphen between a space character and a word is pretty much a garbage in, garbage out situation.

Since you appear to have the epub to be able to convert it, why not simply remove the soft hyphens and reconvert? To me, this would be the more sensible approach and allows those who use soft hyphens properly to continue using them.

If you are only given the converted output, time to punt the garbage back to the originator and tell them to fix it.
DNSB is offline   Reply With Quote
Old 02-29-2024, 12:12 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
The EPUB is the source or you could not convert (DRM infested).

Try using the built in 'Polish' tool (you may need to add it to a toolbar):remove soft hyphens.

If that does not get them all,
You will need to use the editor (and some REGEX)
theducks is offline   Reply With Quote
Old 02-29-2024, 12:41 PM   #8
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
Quote:
Originally Posted by DNSB View Post
Since you appear to have the epub to be able to convert it, why not simply remove the soft hyphens and reconvert?
Attached epub is just demo file for testing.

Of course, problem can _easily_ be solved in epub.

But conversion should _never_ remove spaces, and source epub file is not _always_ available. Sorry, but suggestions to solve it by editing epub are not very useful/relevant.

Last edited by quinta@ebf.cz; 02-29-2024 at 12:44 PM.
quinta@ebf.cz is offline   Reply With Quote
Old 02-29-2024, 12:50 PM   #9
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,164
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by quinta@ebf.cz View Post
But conversion should _never_ remove spaces. Sorry, but suggestions to solve it by editing epub are not very relevant.
Did you miss where I explained that conversions break when the source has formatting that should not exist?

The source is broken. There should never ever be a space with a soft hyphen. A soft hypen should only ever be inside a word, and shouldn't be in an epub anyway.

No regex needed. Simply replace every soft hyphen with nothing. It's not like non-breaking spaces, which are needed in ebooks, like between a number and a street or a number and a type of unit etc.
Quoth is offline   Reply With Quote
Old 02-29-2024, 01:03 PM   #10
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,464
Karma: 145525534
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by quinta@ebf.cz View Post
Attached epub is just demo file for testing.

Of course, problem can _easily_ be solved in epub.

But conversion should _never_ remove spaces, and source epub file is not _always_ available. Sorry, but suggestions to solve it by editing epub are not very useful/relevant.
So what you are wanting is to have a file with a structural error since a soft hyphen should never occur between a space and a word fixed. If you do not have the original epub to fix it yourself, get the person who generated the flawed ePub to fix their garbage output. Otherwise, GIGO since a soft hyphen should never occur outside of a word, the space in front of the soft hyphen is an obvious error.

Out of a perhaps morbid curiosity, was this a commercially available, public domain or freely distributable epub? If so, you might want to complain to the source.
DNSB is offline   Reply With Quote
Old 02-29-2024, 01:08 PM   #11
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,015
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
The only place I know for sure that soft hyphens work and do not break searching is on a Kindle with KF8 format eBooks.
JSWolf is offline   Reply With Quote
Old 02-29-2024, 01:12 PM   #12
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,015
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by quinta@ebf.cz View Post
Attached epub is just demo file for testing.

Of course, problem can _easily_ be solved in epub.

But conversion should _never_ remove spaces, and source epub file is not _always_ available. Sorry, but suggestions to solve it by editing epub are not very useful/relevant.
The source ePub has to be available. Otherwise, where did the DocX come from?

I know two ways soft hyphens can get into an ePub. One is with calibre's polish and the other is with the Hyphenate This! plugin. But both do not put soft hyphens outside of words as in your example.

If you really do not have the source ePub, remove all of the soft hyphens and do a spell check.
JSWolf is offline   Reply With Quote
Old 02-29-2024, 01:23 PM   #13
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
Ok. I know how to remove SH from epub. This is quite easy, no need of more explanations about that.

EDIT:
Key informations for such sort of task:
- SH can be searched/replaced by regular expression \xad (or \u00ad) (because SH is character U+0173, etc.)
- be awared: soft hyphen character itself is not visible in Calibre editor
- removing soft hyphens action is also part of Calibre "Polish ebook" tool

Quote:
Originally Posted by Quoth View Post
I explained that conversions break when the source has formatting that should not exist?
In other words, you are saying that dropping of space during the "space+SH" export is intentional. I don't think so.

BTW, interesting fact: Seems not every space from "space+SH" combos is dropped. Don't know what does it mean. Just interesting. : )

Last edited by quinta@ebf.cz; 03-01-2024 at 04:45 AM.
quinta@ebf.cz is offline   Reply With Quote
Old 02-29-2024, 01:39 PM   #14
quinta@ebf.cz
Connoisseur
quinta@ebf.cz began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Mar 2019
Device: Kindle 3 Paperwhite
Quote:
Originally Posted by JSWolf View Post
If you really do not have the source ePub, remove all of the soft hyphens and do a spell check.
Thanks, this is a suitable emergency procedure.
Even though it would be much better not to have to use it. So I actually hope the conversion will be fixed, sooner or later.
quinta@ebf.cz is offline   Reply With Quote
Old 02-29-2024, 03:01 PM   #15
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,164
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by quinta@ebf.cz View Post
In other words, you are saying that dropping of space during the "space+SH" export is intentional. I don't think so.
No, I'm not writing that. If the source for any kind of conversion is broken in an unexpected way, then a conversion may do something unexpected. Maybe it's a bug, but it's a bug that wouldn't show up in years of testing because you do not get a space beside a soft hyphen ever.

Your input file has a serious mistake. Fix broken input. All computer programs are famous for Garbage In gives Garbage Out.

I'm sure there are other stupid things that should never ever be in an epub that will break conversion and Amazon's conversions break more easily than Calibre. But Amazon produces perfect mobo, azw3 and KFX from epub uploads to KDP that are corrent. This is a broken epub. It's a format error I've never seen in ten years.

Last edited by Quoth; 02-29-2024 at 03:05 PM.
Quoth is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Feature Request: Make soft hyphen unicode character visible Morea Editor 14 07-25-2023 10:20 AM
soft hyphens in docx conversion output quinta@ebf.cz Conversion 3 09-07-2021 10:04 AM
docx to epub; one-character pages missing quinta@ebf.cz Conversion 3 07-31-2020 03:32 PM
Soft Hyphen lhuxley Editor 3 03-23-2015 08:02 PM
Soft hyphen Kumabjorn Writers' Corner 32 07-13-2014 12:00 AM


All times are GMT -4. The time now is 10:57 AM.


MobileRead.com is a privately owned, operated and funded community.