Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-08-2010, 10:28 AM   #1
RichieTheK
Enthusiast
RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.
 
Posts: 33
Karma: 480532
Join Date: Mar 2010
Location: Chapel Hill, North Carolina, USA
Device: Nexus 7 (2012), Samsung Galaxy Pro 8.4
PDF to EPUB - spurious paragraph breaks

When converting from PDF to EPUB, I have noticed that callibre will insert a paragraph break whenever a line in the PDF document ends with a dash or apostrophe/single-quote that is flush with the right margin of the document. This behavior is repeatable.

In the case of the dash, I cannot think of a case where one would end a paragraph. The cases that I see are where dashes are used to indicate parenthetical remarks, and appear in the middle of a sentence. If, by chance, a dash should wind up flush against the right margin, calibre inserts a paragraph break.

The case of the apostrophe/single-quote is more difficult because there are cases where a single-quote can end a paragaph. However, I have seen calibre insert a paragraph break where it is not appropriate. A paragraph break should not be generated if the single-quote/apostrophe is preceded by a comma or lower-case character, or if the first character on the following line is a lower-case character.

Again, let me stress that this only occurs if the dash/single-quote/apostrophe is flush with the right margin of the PDF document.

I don't know if the PDF structure-detection can be fine-tuned to detect these cases, but if someone is willing to try, I have a single-page PDF document, extracted from a larger book, that shows both cases.
RichieTheK is offline   Reply With Quote
Old 09-08-2010, 10:59 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Calibre isn't inserting paragraph breaks, it's removing them. PDF is all hard line breaks. There is a function which removes them, but it doesn't remove every type of break to avoid false positives. False negatives are slightly annoying while false positives can confuse the meaning of the text.

Line endings ending in a dash are already un-wrapped in my recollection of the code, I'm guessing the dash you're seeing is a different unicode character than the standard hyphen/dash. Open a bug with the file and it can get added. As you noted the single quote case can't be reliably unwrapped, so not much to be done there.
ldolse is offline   Reply With Quote
 
Advertisement
Old 09-08-2010, 12:27 PM   #3
RichieTheK
Enthusiast
RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.RichieTheK ought to be getting tired of karma fortunes by now.
 
Posts: 33
Karma: 480532
Join Date: Mar 2010
Location: Chapel Hill, North Carolina, USA
Device: Nexus 7 (2012), Samsung Galaxy Pro 8.4
Thank you for the reply. It looks like the dash is the Unicode em-dash (UTF E2 80 94). That's what is in the converted HTML.

I'll file a bug report.
RichieTheK is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Paragraph breaks thedevilsjester Calibre 2 09-07-2010 01:26 PM
scanned PDF has weird paragraph breaks. Possible to fix lunixer PDF 0 08-30-2010 11:47 PM
Spurious Line Breaks Halk Workshop 1 05-15-2010 02:22 PM
Odd line/paragraph breaks in epub and FB2? PKFFW Calibre 4 10-01-2009 08:49 AM
Create proper paragraph breaks in ereader2html acj412 Workshop 2 08-11-2009 12:02 AM


All times are GMT -4. The time now is 11:46 PM.


MobileRead.com is a privately owned, operated and funded community.