![]() |
#1 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jul 2010
Device: Aldiko App - Android 2.1 HTC Hero
|
Detecting Chapters in PDF -> ePub conversion
Ok.. So I've searched the forums for the past couple hours, and I can't seem to get a good answer, or find one that solves my particular problem.
The PDF file I have doesn't have the Chapters tagged as h1 or h2, or at least Calibre isn't detecting them as such when I convert them to ePubs. So when I just throw the ePub onto my phone, the loading is unbearable, as the ePub is only split into 6 xhtml files in the end (and not split in any logical spots). Is there an Xpath expression that somebody could cook up that would detect the chapters and split them properly? I know there are more tedious ways of doing it myself (opening the epub in Sigil, and inserting chapter breaks.. but for 100+ chapters, thats would take a while) From what I can tell, the Chapters are just bold-faced ex. "Chapter 12" ![]() If you need any more information, I'll try and give it. ![]() |
![]() |
![]() |
![]() |
#2 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
Run a conversion from PDF to epub with debugging turned on and then look at the intermediate HTML output (conversion goes something like PDF -> very rough HTML -> cleaned up HTML -> epub). Looking at the cleaned up HTML step should help you figure out how to find chapter breaks.
Or you could even take the intermediate HTML, modify it yourself to make chapters <h1> or <h2> elements, load the HTML into calibre, and convert to epub from the HTML rather than PDF. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
ePub Maker
![]() Posts: 120
Karma: 16
Join Date: Dec 2009
Location: Mordor
Device: iPad,Kindle 3, Nook 2
|
Maybe you can use Regular expression
If you have the HTML files converted from PDF, ( All PDF must be converted into HTML firstly, then to ePub) you can use a Regular Expression tool to replace all Chapter names to proper HTML code.
Such as if your chapter HTML code has a structure as <b>Chapter 12</b> You can use "<b>(Chapter \d+)</b>" to grab them all, and replace them to "<h2>$1</h2>" No matter what, this is a work needs professional knowledge on Regular Expression. And if your Chapter names were written in various HTML code, they only can be grabbed manually. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to EPUB conversion | jfontana | Calibre | 2 | 03-17-2010 03:09 AM |
pdf to epub conversion | mediax | Sigil | 16 | 11-19-2009 03:48 PM |
Help with conversion from PDF to EPUB | Fizz | Calibre | 5 | 10-25-2009 11:48 AM |
ePub Chapters vs. Stanza Chapters | kjk | Sigil | 4 | 09-14-2009 10:50 AM |
Detecting chapters | Tibor | Calibre | 4 | 01-17-2009 01:25 PM |