Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 04-04-2014, 06:11 AM   #1
Julien Pham
Connoisseur
Julien Pham began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Chapter detection from a pdf book

Hi,

I have downloaded on the internet a free book in pdf format, what I wonder if there is a way for Calibre to detect chapters from a pdf book when doing conversion.

I doubt the original book has chapters, but a new chapter means a new page, and the chapter title is displayed right aligned and with a horizontal line below.

What I would like to achieve is to have about the same kind of presentation in the ePub file as with the pdf file.

Here is the book. It is a not copyrighted book, a fan fiction about a game, free to download.

Thanks
Attached Files
File Type: pdf The Retaliator - Vol 1.pdf (368.2 KB, 404 views)
Julien Pham is offline   Reply With Quote
Old 04-04-2014, 09:18 AM   #2
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Yes it is.

You can use Search&Replace and the magic wizard to define your chapters. In addition, you need to make some minor adjustments in your converting dialog. Pls. follow the attached pictures.

Picture Aufzeichnen.jpg:
This looks a bit complicated but it is not really (follow the blue path). I start with tab Search&Replace. From there I go to the wizard, select the pdf-file (if there is more then one format), and look how calibre is interpreting the pdf-file. You can see each chapter is starting on a new page and the coding is
...
<hr/>
<a name=2></a>Jetsam <br>
...

<hr/> is the page break, <a name=2></a> is the page count (name=2), followed by the chapter name and the line ends with a line break <b>.
For each new chapter the only change is in page number a chapter name. I do the regex in three parts:
<a name=2></a> --> page number will change, therefore the first part (\1) is
(<a name=\d+></a>)

Jetsam --> the chapter is Jetsam with some spaces. This is the second part (\2):
([A-Za-z\s]*)
( *, because the chapters in your book have more then one word)

<br> --> this is the line break. For this, I use the third part (\3):
(<br>)

All together, this will end in a complete regex:
(<a name=\d+></a>)([A-Za-z\s]*)(<br>)

Use the test button. Then you will see the highlighted results. If everything looks good, you can save this with ok ((follow the red path), you will be back on the main window of Search&Replace, and the regex is taking over in the search expression.
The next step will be the definition of a placeholder for chapter detection, I will call it <chapter-new>. We need this later for the TOC. We defined the regex in three parts and I like to place this <chapter-new> between part one and two. The complete Replacement Text is:
\1<chapter-new>\2\3

Please save this with Save. This was the first part what identity the chapters of the book. The following steps are much faster to do and we will start now with explaining Calibre how to find the chapters for the ToC. You know the added label <chapter-new>.

Picture Aufzeichnen1.jpg
For this, I select the tab Table of Content. The defined tag label <chapter-new> from the first step can be use now for building the ToC with Xpath. I will use the Level one TOC. It is every time a good idea to use the wizard for this (see the blue path). A new window will open and I put <chapter-new> as matching HTML tag name and save it with ok (follow the red path). The Xpath expression for this particular book is:
//h:chapter-new
Please delete all other entries as you can see it in the picture. Now this step is completed and you have a TOC : ).
The next step is a little bit fine-tuning.

Picture Aufzeichnen2.jpg
Please select now Tab Structure Detection. I like to have for each chapter an own HTML-File in an EPUB. For this, I need to explain Calibre how to split the file. I use once more the chapter label and Xpath (it works the same way as with TOC). Follow first the blue path and define <chapter-new> as matching HTML tag name and save it with ok (follow now the red path). The Xpath expression for this particular book is:
//h:chapter-new
Chapter mark is set to pagebreak. Please delete all other entries as you can see it in the picture. This is all for this step. The next one is as well fine-tuning.

Picture Aufzeichnen3.jpg
I like to see the chapter of the book in bold (only an example). Moreover, once more I can use the label <chapter-new> for this, because with converting the file, each chapter headline becomes this as tag and I can use it in CSS to give it a style definition. Select Look&Feel. There you will find the section extra CSS and put in for the tag chapter-new:

chapter-new{font-weight: bold;}

This is all you need to do.

Before I do the last step, please select the tab Heuristic Processing and disable this process by deleting the first checkmark.

Picture Aufzeichnen4.jpg
Now to the last step. I guess this is the point where most people are struggling with PDF-Conversion because people do not understand this very well. I am talking about Line un-warping factor. Please select tab PDF-Input. The standard definition for this factor is about 0.45. This is for most PDF-Books to high. I have set it in this example to 0.20. This will help Calibre to get better results for determine a paragraph. This varied from book to book. Play with this parameter.
Now you can do the conversion of your PDF-File and look if it matches to your expectations. One additional hint: You can do a conversion on each step again. If it does not meet your expectation, go back, make an adjustment, and try it once more. Calibre stores the conversion settings for this book.
Hope this will help you.
Attached Thumbnails
Click image for larger version

Name:	Aufzeichnen.JPG
Views:	360
Size:	174.6 KB
ID:	121214   Click image for larger version

Name:	Aufzeichnen1.JPG
Views:	270
Size:	145.7 KB
ID:	121215   Click image for larger version

Name:	Aufzeichnen2.JPG
Views:	292
Size:	135.0 KB
ID:	121216   Click image for larger version

Name:	Aufzeichnen3.JPG
Views:	268
Size:	109.0 KB
ID:	121217   Click image for larger version

Name:	Aufzeichnen4.JPG
Views:	257
Size:	82.8 KB
ID:	121218  

Last edited by Divingduck; 04-04-2014 at 11:07 AM.
Divingduck is offline   Reply With Quote
Advert
Old 04-04-2014, 10:27 AM   #3
Julien Pham
Connoisseur
Julien Pham began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Waow thanks Indeed it seems a bit complicated, but I guess I'll handle it after a few reading ^^

btw I wonder, is Calibre the best tool to convert books? And, I wonder as well, it does not exist tools to edit mobi files, so I have to create an ePub file before, right?

Oh and if I follow your tutorial, I end up with chapters but then the chapters are displayed as html code :

<p class="calibre1">Jetsam</p>

This is not very good, I mean... I would like to have a given class for chapters so I can decide how I want them to be displayed. And if possible I would like to keep the <hr> so to have an horizontal line after the chapter name. I guess this is all about fine tuning the convert thing...

Last edited by Julien Pham; 04-04-2014 at 10:41 AM.
Julien Pham is offline   Reply With Quote
Old 04-04-2014, 11:17 AM   #4
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Then you have forgot something in the setup. Take care that heuristic is off and you have clean up all entries the same way. Attached you will find the converted file.

Edit: If you do it the same way I did then you will see this:
Attached Thumbnails
Click image for larger version

Name:	Aufzeichnen5.JPG
Views:	231
Size:	359.1 KB
ID:	121221  
Attached Files
File Type: epub Anderanged, Kazee - Vol 1 eng,.epub (247.0 KB, 253 views)

Last edited by Divingduck; 04-04-2014 at 11:27 AM.
Divingduck is offline   Reply With Quote
Old 04-04-2014, 12:49 PM   #5
Julien Pham
Connoisseur
Julien Pham began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Thanks.
Do you have any tip though to:
- align the chapter title to right
- put a <hr/> tag after the chapter title? I have tried to use as replacement text instead of yours:
\1<chapter-new>\2\3<hr/>

But it did not work.

Oh and btw with what you told me if I open my book with Calibre I see the chapter title in bold, but if I open the book in Sigil, I do not see it in bold...

Oh and... how comes I do not see the "chapter-new" tag when I open the book in Sigil?

In sigil the chapter title is like this:
<p class="calibre1">Found</p>

The problem is that the calibre1 class is used by all the text in the book as well...

Last edited by Julien Pham; 04-04-2014 at 01:38 PM.
Julien Pham is offline   Reply With Quote
Advert
Old 04-04-2014, 03:19 PM   #6
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Sorry for my late answer.

This will come to a nice project.

First one, the line issue:
The PDF-Conversion does only extract the text. Therefore, there is no line info in the XHTML. However, we can fix it. In order to make this happen, I modify the first S&R and add a second placeholder for a chapter line called chapter-line and add this to the replacement text. As I need to have it in a separate line below, I need to add a <br> (\3).

Replace with:
\1<chapter-new>\2\3<chapter-line>\3

See picture Aufzeichnen6 S&R line 1.

Now you can run this conversion first time to look on the results (because you need to define after this the regex for implementing the line). The result is this:
Line 1: <p class="calibre1">
Line 2: <chapter-new id="calibre_toc_1" class="calibre3">Jetsam </chapter-new></p>
Line 3: <p class="calibre1"><chapter-line/></p>

The last line is the line to look at first. This need to be replaced by <hr>. Now I add a second S&R:
Search for: <p class="calibre1"><chapter-line/></p>
Replace with: <hr>
or if you like a bit styling
Replace with: <hr noshade size=1 width=70% align=center>
See picture Aufzeichnen6 S&R line 3. This one and the next S&R will be used later with an EPUB to EPUB conversion.


Second one, the formatting issue <chapter-new>:
Well, this is only a help construct and we need to get rid of it in a second conversion. I do not know an other way if you want to do it more or less automatically. The other way is to do it with Calibre editor, as it is simple S&R.

I will use for now the conversion. As you have already seen, there is an overlapping definition for chapters with calibre1 and calibre3 and in addition with the first placeholder <chapter-new>. Therefor I split it in three parts (the first and the last part I like to get rid of, the middle part I need to stay with).
Attention , this is a little tricky because you need to select over two lines for getting the hidden line break. Make a copy past from the wizard, (see picture Aufzeichnen7.jpeg) and select everything from:

class="calibre1">
<chapter-new id="calibre_toc_1" class="calibre3">Jetsam </chapter-new>

(Only the beginning p and the ending </p> is not in the selection because this is what will stay) and replace the part

id="calibre_toc_1" class="calibre3">Jetsam

with (.*) and set the rest before and behind in brackets (). Check with test. You need 19 occurrences. If this is ok, take this over and make the S&R complete:

Search for:
(class="calibre1">
<chapter-new)(.*)(</chapter-new>)
Replace with:
\2

Please move this S&R at position 2. (See picture Aufzeichnen6.jpg S&R line 2)

Finally, I made an adjustment for the CSS in Look&Feel:
chapter-new{text-align:center;font-weight:bold;}
and in Structure Detection: Enable Remove first image

Here we are. Everything is prepared. Delete every format excluding the PDF. Then make first the conversion PDF to EPUB (take care of the Line un-warping factor) and then make an EPUB to EPUB conversion.
If it looks like this, then everything went fine und you can do your personal fine-tuning:
Attached Thumbnails
Click image for larger version

Name:	Aufzeichnen6.JPG
Views:	224
Size:	115.4 KB
ID:	121228   Click image for larger version

Name:	Aufzeichnen7.JPG
Views:	209
Size:	182.4 KB
ID:	121231   Click image for larger version

Name:	Aufzeichnen8.JPG
Views:	217
Size:	307.1 KB
ID:	121232  
Attached Files
File Type: epub Anderanged, Kazee - Vol 1 eng,.epub (247.5 KB, 252 views)

Last edited by Divingduck; 04-04-2014 at 04:44 PM.
Divingduck is offline   Reply With Quote
Old 04-04-2014, 03:53 PM   #7
Julien Pham
Connoisseur
Julien Pham began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Calibre editor is as good as Sigil when it comes to edit ePub?
Julien Pham is offline   Reply With Quote
Old 04-04-2014, 04:02 PM   #8
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,939
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Julien Pham View Post
Calibre editor is as good as Sigil when it comes to edit ePub?
Depends
IMHO the check is better

some things are not done yet
some things work differently

I use both
theducks is offline   Reply With Quote
Old 04-04-2014, 06:49 PM   #9
Julien Pham
Connoisseur
Julien Pham began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
What I did with this eBook in fact is instead of putting chapter-new for new chapters, I have put <h1>\2</h1> as replacement string, then I have let Calibre uses h1 to find new chapters.

Then I have edited the ePub with Calibre, and I have added in the stylesheet :
h1 {
text-align: right;
border-bottom: medium groove;
}

This way the h1 is right centered and with an horizontal line.

I have to check some things again to fine tune this, but it looks good so far.

Oh btw I noticed we can convert to AZW3 (as I have a Kindle, this format is good) and Calibre can edit AZW3 files as well. (Sigil cannot)

Something fun though is that the personal document thing of Amazon does not accept AZW3 files, though this is the Kindle format now. I can send to my personal document only mobi files ^^

Last edited by Julien Pham; 04-04-2014 at 07:14 PM.
Julien Pham is offline   Reply With Quote
Old 04-05-2014, 07:13 AM   #10
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
I like this variation.

I don't understand why Amazon make that decsission. You can convert to mobi. Take a look on the mobi export tab and choose instead of "old" Mobi format the setup "both". This will maybe work for you.
Divingduck is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Chapter Detection Tutorial ldolse Conversion 34 01-11-2012 06:32 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM
chapter detection in any book yuki86 Calibre 9 05-06-2009 06:54 AM
Cant find help for chapter detection fallwood Calibre 6 12-10-2008 01:20 PM
Calibre chapter detection AKninja04 Calibre 5 09-14-2008 12:09 PM


All times are GMT -4. The time now is 01:01 AM.


MobileRead.com is a privately owned, operated and funded community.