Detecting Chapters in PDF -> ePub conversion

jUgGsY · 07-14-2010, 11:50 PM

Ok.. So I've searched the forums for the past couple hours, and I can't seem to get a good answer, or find one that solves my particular problem.

The PDF file I have doesn't have the Chapters tagged as h1 or h2, or at least Calibre isn't detecting them as such when I convert them to ePubs. So when I just throw the ePub onto my phone, the loading is unbearable, as the ePub is only split into 6 xhtml files in the end (and not split in any logical spots).

Is there an Xpath expression that somebody could cook up that would detect the chapters and split them properly? I know there are more tedious ways of doing it myself (opening the epub in Sigil, and inserting chapter breaks.. but for 100+ chapters, thats would take a while)

From what I can tell, the Chapters are just bold-faced ex. "Chapter 12"

If you need any more information, I'll try and give it.

toddos · 07-15-2010, 02:37 AM

Run a conversion from PDF to epub with debugging turned on and then look at the intermediate HTML output (conversion goes something like PDF -> very rough HTML -> cleaned up HTML -> epub). Looking at the cleaned up HTML step should help you figure out how to find chapter breaks.

Or you could even take the intermediate HTML, modify it yourself to make chapters <h1> or <h2> elements, load the HTML into calibre, and convert to epub from the HTML rather than PDF.

eping · 07-15-2010, 04:53 AM

If you have the HTML files converted from PDF, ( All PDF must be converted into HTML firstly, then to ePub) you can use a Regular Expression tool to replace all Chapter names to proper HTML code.
Such as if your chapter HTML code has a structure as
<b>Chapter 12</b>
You can use "<b>(Chapter \d+)</b>" to grab them all, and replace them to
"<h2>$1</h2>"
No matter what, this is a work needs professional knowledge on Regular Expression.

And if your Chapter names were written in various HTML code, they only can be grabbed manually.

07-14-2010, 11:50 PM	#1
jUgGsY Junior Member Posts: 1 Karma: 10 Join Date: Jul 2010 Device: Aldiko App - Android 2.1 HTC Hero	Detecting Chapters in PDF -> ePub conversion Ok.. So I've searched the forums for the past couple hours, and I can't seem to get a good answer, or find one that solves my particular problem. The PDF file I have doesn't have the Chapters tagged as h1 or h2, or at least Calibre isn't detecting them as such when I convert them to ePubs. So when I just throw the ePub onto my phone, the loading is unbearable, as the ePub is only split into 6 xhtml files in the end (and not split in any logical spots). Is there an Xpath expression that somebody could cook up that would detect the chapters and split them properly? I know there are more tedious ways of doing it myself (opening the epub in Sigil, and inserting chapter breaks.. but for 100+ chapters, thats would take a while) From what I can tell, the Chapters are just bold-faced ex. "Chapter 12" If you need any more information, I'll try and give it.

07-15-2010, 04:53 AM	#3
eping ePub Maker Posts: 120 Karma: 16 Join Date: Dec 2009 Location: Mordor Device: iPad,Kindle 3, Nook 2	Maybe you can use Regular expression If you have the HTML files converted from PDF, ( All PDF must be converted into HTML firstly, then to ePub) you can use a Regular Expression tool to replace all Chapter names to proper HTML code. Such as if your chapter HTML code has a structure as <b>Chapter 12</b> You can use "<b>(Chapter \d+)</b>" to grab them all, and replace them to "<h2>$1</h2>" No matter what, this is a work needs professional knowledge on Regular Expression. And if your Chapter names were written in various HTML code, they only can be grabbed manually.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to EPUB conversion	jfontana	Calibre	2	03-17-2010 03:09 AM
pdf to epub conversion	mediax	Sigil	16	11-19-2009 03:48 PM
Help with conversion from PDF to EPUB	Fizz	Calibre	5	10-25-2009 11:48 AM
ePub Chapters vs. Stanza Chapters	kjk	Sigil	4	09-14-2009 10:50 AM
Detecting chapters	Tibor	Calibre	4	01-17-2009 01:25 PM

07-15-2010, 02:37 AM	#2
toddos Guru Posts: 695 Karma: 822675 Join Date: May 2010 Device: Kobo Aura, Nokia Lumia 920 (Freda)	Run a conversion from PDF to epub with debugging turned on and then look at the intermediate HTML output (conversion goes something like PDF -> very rough HTML -> cleaned up HTML -> epub). Looking at the cleaned up HTML step should help you figure out how to find chapter breaks. Or you could even take the intermediate HTML, modify it yourself to make chapters <h1> or <h2> elements, load the HTML into calibre, and convert to epub from the HTML rather than PDF.

Advert