MobileRead Forums - View Single Post

Divingduck · 04-04-2014, 09:18 AM

Yes it is.

You can use Search&Replace and the magic wizard to define your chapters. In addition, you need to make some minor adjustments in your converting dialog. Pls. follow the attached pictures.

Picture Aufzeichnen.jpg:
This looks a bit complicated but it is not really (follow the blue path). I start with tab Search&Replace. From there I go to the wizard, select the pdf-file (if there is more then one format), and look how calibre is interpreting the pdf-file. You can see each chapter is starting on a new page and the coding is
...
<hr/>
<a name=2></a>Jetsam 
...

<hr/> is the page break, <a name=2></a> is the page count (name=2), followed by the chapter name and the line ends with a line break .
For each new chapter the only change is in page number a chapter name. I do the regex in three parts:
<a name=2></a> --> page number will change, therefore the first part (\1) is
(<a name=\d+></a>)

Jetsam --> the chapter is Jetsam with some spaces. This is the second part (\2):
([A-Za-z\s]*)
( *, because the chapters in your book have more then one word)

 --> this is the line break. For this, I use the third part (\3):
( )

All together, this will end in a complete regex:
(<a name=\d+></a>)([A-Za-z\s]*)( )

Use the test button. Then you will see the highlighted results. If everything looks good, you can save this with ok ((follow the red path), you will be back on the main window of Search&Replace, and the regex is taking over in the search expression.
The next step will be the definition of a placeholder for chapter detection, I will call it <chapter-new>. We need this later for the TOC. We defined the regex in three parts and I like to place this <chapter-new> between part one and two. The complete Replacement Text is:
\1<chapter-new>\2\3

Please save this with Save. This was the first part what identity the chapters of the book. The following steps are much faster to do and we will start now with explaining Calibre how to find the chapters for the ToC. You know the added label <chapter-new>.

Picture Aufzeichnen1.jpg
For this, I select the tab Table of Content. The defined tag label <chapter-new> from the first step can be use now for building the ToC with Xpath. I will use the Level one TOC. It is every time a good idea to use the wizard for this (see the blue path). A new window will open and I put <chapter-new> as matching HTML tag name and save it with ok (follow the red path). The Xpath expression for this particular book is:
//h:chapter-new
Please delete all other entries as you can see it in the picture. Now this step is completed and you have a TOC : ).
The next step is a little bit fine-tuning.

Picture Aufzeichnen2.jpg
Please select now Tab Structure Detection. I like to have for each chapter an own HTML-File in an EPUB. For this, I need to explain Calibre how to split the file. I use once more the chapter label and Xpath (it works the same way as with TOC). Follow first the blue path and define <chapter-new> as matching HTML tag name and save it with ok (follow now the red path). The Xpath expression for this particular book is:
//h:chapter-new
Chapter mark is set to pagebreak. Please delete all other entries as you can see it in the picture. This is all for this step. The next one is as well fine-tuning.

Picture Aufzeichnen3.jpg
I like to see the chapter of the book in bold (only an example). Moreover, once more I can use the label <chapter-new> for this, because with converting the file, each chapter headline becomes this as tag and I can use it in CSS to give it a style definition. Select Look&Feel. There you will find the section extra CSS and put in for the tag chapter-new:

chapter-new{font-weight: bold;}

This is all you need to do.

Before I do the last step, please select the tab Heuristic Processing and disable this process by deleting the first checkmark.

Picture Aufzeichnen4.jpg
Now to the last step. I guess this is the point where most people are struggling with PDF-Conversion because people do not understand this very well. I am talking about Line un-warping factor. Please select tab PDF-Input. The standard definition for this factor is about 0.45. This is for most PDF-Books to high. I have set it in this example to 0.20. This will help Calibre to get better results for determine a paragraph. This varied from book to book. Play with this parameter.
Now you can do the conversion of your PDF-File and look if it matches to your expectations. One additional hint: You can do a conversion on each step again. If it does not meet your expectation, go back, make an adjustment, and try it once more. Calibre stores the conversion settings for this book.
Hope this will help you.

04-04-2014, 09:18 AM	#2
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Yes it is. You can use Search&Replace and the magic wizard to define your chapters. In addition, you need to make some minor adjustments in your converting dialog. Pls. follow the attached pictures. Picture Aufzeichnen.jpg: This looks a bit complicated but it is not really (follow the blue path). I start with tab Search&Replace. From there I go to the wizard, select the pdf-file (if there is more then one format), and look how calibre is interpreting the pdf-file. You can see each chapter is starting on a new page and the coding is ... <hr/> <a name=2></a>Jetsam <br> ... <hr/> is the page break, <a name=2></a> is the page count (name=2), followed by the chapter name and the line ends with a line break <b>. For each new chapter the only change is in page number a chapter name. I do the regex in three parts: <a name=2></a> --> page number will change, therefore the first part (\1) is (<a name=\d+></a>) Jetsam --> the chapter is Jetsam with some spaces. This is the second part (\2): ([A-Za-z\s]) ( , because the chapters in your book have more then one word) <br> --> this is the line break. For this, I use the third part (\3): (<br>) All together, this will end in a complete regex: (<a name=\d+></a>)([A-Za-z\s])(<br>) Use the test button. Then you will see the highlighted results. If everything looks good, you can save this with ok ((follow the red path), you will be back on the main window of Search&Replace, and the regex is taking over in the search expression. The next step will be the definition of a placeholder for chapter detection, I will call it <chapter-new>. We need this later for the TOC. We defined the regex in three parts and I like to place this <chapter-new> between part one and two. The complete Replacement Text is: \1<chapter-new>\2\3 Please save this with Save. This was the first part what identity the chapters of the book. The following steps are much faster to do and we will start now with explaining Calibre how to find the chapters for the ToC. You know the added label <chapter-new>. Picture Aufzeichnen1.jpg* For this, I select the tab Table of Content. The defined tag label <chapter-new> from the first step can be use now for building the ToC with Xpath. I will use the Level one TOC. It is every time a good idea to use the wizard for this (see the blue path). A new window will open and I put <chapter-new> as matching HTML tag name and save it with ok (follow the red path). The Xpath expression for this particular book is: //h:chapter-new Please delete all other entries as you can see it in the picture. Now this step is completed and you have a TOC : ). The next step is a little bit fine-tuning. Picture Aufzeichnen2.jpg Please select now Tab Structure Detection. I like to have for each chapter an own HTML-File in an EPUB. For this, I need to explain Calibre how to split the file. I use once more the chapter label and Xpath (it works the same way as with TOC). Follow first the blue path and define <chapter-new> as matching HTML tag name and save it with ok (follow now the red path). The Xpath expression for this particular book is: //h:chapter-new Chapter mark is set to pagebreak. Please delete all other entries as you can see it in the picture. This is all for this step. The next one is as well fine-tuning. Picture Aufzeichnen3.jpg I like to see the chapter of the book in bold (only an example). Moreover, once more I can use the label <chapter-new> for this, because with converting the file, each chapter headline becomes this as tag and I can use it in CSS to give it a style definition. Select Look&Feel. There you will find the section extra CSS and put in for the tag chapter-new: chapter-new{font-weight: bold;} This is all you need to do. Before I do the last step, please select the tab Heuristic Processing and disable this process by deleting the first checkmark. Picture Aufzeichnen4.jpg Now to the last step. I guess this is the point where most people are struggling with PDF-Conversion because people do not understand this very well. I am talking about Line un-warping factor. Please select tab PDF-Input. The standard definition for this factor is about 0.45. This is for most PDF-Books to high. I have set it in this example to 0.20. This will help Calibre to get better results for determine a paragraph. This varied from book to book. Play with this parameter. Now you can do the conversion of your PDF-File and look if it matches to your expectations. One additional hint: You can do a conversion on each step again. If it does not meet your expectation, go back, make an adjustment, and try it once more. Calibre stores the conversion settings for this book. Hope this will help you. Attached Thumbnails Last edited by Divingduck; 04-04-2014 at 11:07 AM.