![]() |
#1 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
Stripping out dashes on epub convert?
Running 0.7.12, I'm seeing conversion to epub strip (and sometimes paragraph-break) dashes. Source was originally PDF, but I captured debug output and inspected the HTML and the dashes were in the intermediate steps. I tried importing the "processed" HTML and converting to epub with that and the dashes were still stripped in the resulting epub.
|
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,149
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
are these dashes normal hyphens or a special hyphen emdash or the like.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
It looks like pdf2html converted them to 0x00AD, soft-hyphen rather than 0x002D, hyphen-minus. They're then stripped out between the final output from debugging and the actual epub creation.
The paragraph breaks at some of these hyphens appear to be bad line unwrapping on PDF conversion. I could play with line unwrapping to get a better PDF conversion and then manually convert the soft hyphens to regular hyphens. |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,149
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah, well that explains it. IIRC calibre strips soft hyphens because various readers render them incorrectly as normal hyphens making text unreadabe.
|
![]() |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
I'm not sure why pdf2html converted those to soft hyphens. I guess that's how the PDF was made?
Manually fixing them worked, but of course it was a pain (turn on debugging, convert PDF to epub, grab the html output, modify that, import it, convert html to epub, clean up poor PDF line unwrapping in the epub with Sigil). I know it's my fault for wanting to convert PDF, but I hate hate hate PDF as a format ![]() |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,149
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
With PDF everything is black magic. I've never seen such a messy format. My attitude is that if pdftohtml can't handle it, then I don't care about it
![]() |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Where is the stripping of DRM legal? | duckbill | News | 38 | 09-02-2011 01:27 PM |
Noobie and DRM-stripping | thecyberphotog | Workshop | 7 | 12-17-2009 08:17 PM |
BD and dashes problem | Otter | Sony Reader | 1 | 09-25-2007 05:47 AM |