05-30-2010, 05:43 AM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
|
Detect chapter headings with capitalized words
I have a file i am trying to convert from PDF to epub, I am having trouble with the chapter detection as the chapter headings in the PDF read something like 1 CHAPTER NAME, 2 CHAPTER NAME etc ...
When i export to epub i end up with a continuos stream of text as it doesn't pick up those as chapter breaks. Can anyone suggest as way i can get calibre to detect chapters based on that convention of number then capitalized text? Thanks Sam |
05-30-2010, 09:13 AM | #2 |
Asha'man
Posts: 335
Karma: 844
Join Date: May 2010
Location: Canada
Device: Kobo
|
Someone might suggest a fancy regex expression that might work, but filtering only by numbers is pretty broad and could yield a bunch of entries within the body text.
How I've handled this situation is opening the resulting epub file in Sigil, and manually going to each chapter and changing each chapter heading to 'Heading #' style. Sigil will update the TOC automatically if you do this. |
Advert | |
|
05-30-2010, 10:21 AM | #3 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
|
Thanks Stinger, there are 108 of them so i was hoping there may be an automated way but will check out your suggestion in sigil
Cheers |
05-30-2010, 10:30 AM | #4 |
Asha'man
Posts: 335
Karma: 844
Join Date: May 2010
Location: Canada
Device: Kobo
|
Yeah, manually changing 108 headings is definitely something I would want to avoid too...
You might be able to do this with the Calibre TOC autocreate system using a nice XPATH expression, but again, that is easier if you have a book with 'Chapter X' style so you can tell it to look for the word "chapter" within a certain tag. The way you described your book made me think this any XPATH you used would yield lots of junk entries. On the other hand, it could take less time to take this route with the large number of chapters you have, and then just delete the junk entries with Sigil, rather then what I suggested before. Can you post the epub here without violating copyright so I can take a look at the HTML structure? Last edited by Stinger; 05-30-2010 at 10:36 AM. |
05-30-2010, 12:27 PM | #5 |
reader
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
|
I have used a modified version of the standard Chapters and page breaks expression:
Code:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part|0|1|2|3|4|5|6|7|8|9\s+', 'i')) or @class = 'chapter'] |
Advert | |
|
05-31-2010, 05:36 AM | #6 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
|
Thanks
Thanks wallcraft will give that a go now.
Cheers Sam |
05-31-2010, 10:45 AM | #7 |
Reader
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
|
You could try something like:
Code:
//*[re:test(., '^\s*[0-9]+\s+[A-Z ]+\s*$')] |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to MOBI conversion - unable to detect any words | qwerty123456 | Calibre | 1 | 07-22-2010 07:54 AM |
Should ''internet'' be capitalized or lowercase? | taglines | Lounge | 18 | 07-06-2010 04:15 AM |
Repeated Chapter Headings in Kobo Table of Contents | capsolo | Sigil | 5 | 06-20-2010 03:09 AM |
Managing HTML Link Behavior, From TOC to Chapter Headings | FlooseMan Dave | Calibre | 1 | 03-31-2010 11:55 PM |
Image maps as Chapter Headings? | kjk | Sigil | 6 | 10-17-2009 06:27 PM |