09-08-2010, 09:28 AM | #1 |
Enthusiast
Posts: 36
Karma: 480532
Join Date: Mar 2010
Location: Chapel Hill, North Carolina, USA
Device: Nexus 7 (2012), Samsung Galaxy Pro 8.4
|
PDF to EPUB - spurious paragraph breaks
When converting from PDF to EPUB, I have noticed that callibre will insert a paragraph break whenever a line in the PDF document ends with a dash or apostrophe/single-quote that is flush with the right margin of the document. This behavior is repeatable.
In the case of the dash, I cannot think of a case where one would end a paragraph. The cases that I see are where dashes are used to indicate parenthetical remarks, and appear in the middle of a sentence. If, by chance, a dash should wind up flush against the right margin, calibre inserts a paragraph break. The case of the apostrophe/single-quote is more difficult because there are cases where a single-quote can end a paragaph. However, I have seen calibre insert a paragraph break where it is not appropriate. A paragraph break should not be generated if the single-quote/apostrophe is preceded by a comma or lower-case character, or if the first character on the following line is a lower-case character. Again, let me stress that this only occurs if the dash/single-quote/apostrophe is flush with the right margin of the PDF document. I don't know if the PDF structure-detection can be fine-tuned to detect these cases, but if someone is willing to try, I have a single-page PDF document, extracted from a larger book, that shows both cases. |
09-08-2010, 09:59 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Calibre isn't inserting paragraph breaks, it's removing them. PDF is all hard line breaks. There is a function which removes them, but it doesn't remove every type of break to avoid false positives. False negatives are slightly annoying while false positives can confuse the meaning of the text.
Line endings ending in a dash are already un-wrapped in my recollection of the code, I'm guessing the dash you're seeing is a different unicode character than the standard hyphen/dash. Open a bug with the file and it can get added. As you noted the single quote case can't be reliably unwrapped, so not much to be done there. |
Advert | |
|
09-08-2010, 11:27 AM | #3 |
Enthusiast
Posts: 36
Karma: 480532
Join Date: Mar 2010
Location: Chapel Hill, North Carolina, USA
Device: Nexus 7 (2012), Samsung Galaxy Pro 8.4
|
Thank you for the reply. It looks like the dash is the Unicode em-dash (UTF E2 80 94). That's what is in the converted HTML.
I'll file a bug report. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Paragraph breaks | thedevilsjester | Calibre | 2 | 09-07-2010 12:26 PM |
scanned PDF has weird paragraph breaks. Possible to fix | lunixer | 0 | 08-30-2010 10:47 PM | |
Spurious Line Breaks | Halk | Workshop | 1 | 05-15-2010 01:22 PM |
Odd line/paragraph breaks in epub and FB2? | PKFFW | Calibre | 4 | 10-01-2009 07:49 AM |
Create proper paragraph breaks in ereader2html | acj412 | Workshop | 2 | 08-10-2009 11:02 PM |