|07-15-2010, 12:50 AM||#1|
Join Date: Jul 2010
Device: Aldiko App - Android 2.1 HTC Hero
Detecting Chapters in PDF -> ePub conversion
Ok.. So I've searched the forums for the past couple hours, and I can't seem to get a good answer, or find one that solves my particular problem.
The PDF file I have doesn't have the Chapters tagged as h1 or h2, or at least Calibre isn't detecting them as such when I convert them to ePubs. So when I just throw the ePub onto my phone, the loading is unbearable, as the ePub is only split into 6 xhtml files in the end (and not split in any logical spots).
Is there an Xpath expression that somebody could cook up that would detect the chapters and split them properly? I know there are more tedious ways of doing it myself (opening the epub in Sigil, and inserting chapter breaks.. but for 100+ chapters, thats would take a while)
From what I can tell, the Chapters are just bold-faced ex. "Chapter 12"
If you need any more information, I'll try and give it.
|07-15-2010, 03:37 AM||#2|
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
Run a conversion from PDF to epub with debugging turned on and then look at the intermediate HTML output (conversion goes something like PDF -> very rough HTML -> cleaned up HTML -> epub). Looking at the cleaned up HTML step should help you figure out how to find chapter breaks.
Or you could even take the intermediate HTML, modify it yourself to make chapters <h1> or <h2> elements, load the HTML into calibre, and convert to epub from the HTML rather than PDF.
|07-15-2010, 05:53 AM||#3|
Join Date: Dec 2009
Device: iPad,Kindle 3, Nook 2
Maybe you can use Regular expression
If you have the HTML files converted from PDF, ( All PDF must be converted into HTML firstly, then to ePub) you can use a Regular Expression tool to replace all Chapter names to proper HTML code.
Such as if your chapter HTML code has a structure as
You can use "<b>(Chapter \d+)</b>" to grab them all, and replace them to
No matter what, this is a work needs professional knowledge on Regular Expression.
And if your Chapter names were written in various HTML code, they only can be grabbed manually.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|PDF to EPUB conversion||jfontana||Calibre||2||03-17-2010 04:09 AM|
|pdf to epub conversion||mediax||Sigil||16||11-19-2009 04:48 PM|
|Help with conversion from PDF to EPUB||Fizz||Calibre||5||10-25-2009 12:48 PM|
|ePub Chapters vs. Stanza Chapters||kjk||Sigil||4||09-14-2009 11:50 AM|
|Detecting chapters||Tibor||Calibre||4||01-17-2009 02:25 PM|