PDF-> EPUB conversion splits paragraph

marekgregor · 01-12-2011, 05:30 AM

PDF->EPUB conversion incorrectly splits paragraph where the line ends with character containing diacritics. Look into debuging folder shows that input/index.html contains paragraph:

„Já jsem ráda, žes to neudělal, ale jsem ti vděčná, žes mě 
před ním chránil.“ 

which is processed in parsed/index.html as:

„Já jsem ráda, žes to neudělal, ale jsem ti vděčná, žes mě
před ním chránil.“

what is wrong because it creates two paragraphs from one because of character ě.

Do you know how can I fix paragraph splitting to handle also diacritics.

thanks

ldolse · 01-12-2011, 10:04 AM

PDF conversion currently relies on detecting lower-case characters on the initial line. Unfortunately there isn't any library which defines what those lower-case characters are across all languages. ě wasn't in the list of characters, but it will be added for for one of the upcoming releases.

At some point in the future a new pdf engine will come out which uses other types of tests to decide when to unwrap a line, but for now the code is sticking with lowercase characters without punctuation.

01-12-2011, 05:30 AM	#1
marekgregor Junior Member Posts: 1 Karma: 10 Join Date: Jan 2011 Device: none	PDF-> EPUB conversion splits paragraph PDF->EPUB conversion incorrectly splits paragraph where the line ends with character containing diacritics. Look into debuging folder shows that input/index.html contains paragraph: „Já jsem ráda, žes to neudělal, ale jsem ti vděčná, žes mě<br> před ním chránil.“<br> which is processed in parsed/index.html as: <p>„Já jsem ráda, žes to neudělal, ale jsem ti vděčná, žes mě</p> <p>před ním chránil.“</p> what is wrong because it creates two paragraphs from one because of character ě. Do you know how can I fix paragraph splitting to handle also diacritics. thanks

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help: pdf to epub conversion in Calibre splits paragraphs	leday	Calibre	13	09-15-2013 02:10 PM
PDF to EPUB Conversion	LuchoResto	General Discussions	1	11-19-2010 04:54 PM
PDF to EPUB - spurious paragraph breaks	RichieTheK	Calibre	2	09-08-2010 11:27 AM
TXT conversion to ePub or LRF - paragraph formatting	Zapped	Calibre	6	10-23-2009 05:06 PM

01-12-2011, 10:04 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	PDF conversion currently relies on detecting lower-case characters on the initial line. Unfortunately there isn't any library which defines what those lower-case characters are across all languages. ě wasn't in the list of characters, but it will be added for for one of the upcoming releases. At some point in the future a new pdf engine will come out which uses other types of tests to decide when to unwrap a line, but for now the code is sticking with lowercase characters without punctuation.

Advert