Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 05-30-2010, 05:43 AM   #1
fiendmish
Junior Member
fiendmish began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
Detect chapter headings with capitalized words

I have a file i am trying to convert from PDF to epub, I am having trouble with the chapter detection as the chapter headings in the PDF read something like 1 CHAPTER NAME, 2 CHAPTER NAME etc ...

When i export to epub i end up with a continuos stream of text as it doesn't pick up those as chapter breaks.

Can anyone suggest as way i can get calibre to detect chapters based on that convention of number then capitalized text?

Thanks

Sam
fiendmish is offline   Reply With Quote
Old 05-30-2010, 09:13 AM   #2
Stinger
Asha'man
Stinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-books
 
Stinger's Avatar
 
Posts: 335
Karma: 844
Join Date: May 2010
Location: Canada
Device: Kobo
Someone might suggest a fancy regex expression that might work, but filtering only by numbers is pretty broad and could yield a bunch of entries within the body text.

How I've handled this situation is opening the resulting epub file in Sigil, and manually going to each chapter and changing each chapter heading to 'Heading #' style. Sigil will update the TOC automatically if you do this.
Stinger is offline   Reply With Quote
Advert
Old 05-30-2010, 10:21 AM   #3
fiendmish
Junior Member
fiendmish began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
Thanks Stinger, there are 108 of them so i was hoping there may be an automated way but will check out your suggestion in sigil

Cheers
fiendmish is offline   Reply With Quote
Old 05-30-2010, 10:30 AM   #4
Stinger
Asha'man
Stinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-booksStinger has learned how to read e-books
 
Stinger's Avatar
 
Posts: 335
Karma: 844
Join Date: May 2010
Location: Canada
Device: Kobo
Yeah, manually changing 108 headings is definitely something I would want to avoid too...

You might be able to do this with the Calibre TOC autocreate system using a nice XPATH expression, but again, that is easier if you have a book with 'Chapter X' style so you can tell it to look for the word "chapter" within a certain tag. The way you described your book made me think this any XPATH you used would yield lots of junk entries.

On the other hand, it could take less time to take this route with the large number of chapters you have, and then just delete the junk entries with Sigil, rather then what I suggested before.

Can you post the epub here without violating copyright so I can take a look at the HTML structure?

Last edited by Stinger; 05-30-2010 at 10:36 AM.
Stinger is offline   Reply With Quote
Old 05-30-2010, 12:27 PM   #5
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
I have used a modified version of the standard Chapters and page breaks expression:
Code:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part|0|1|2|3|4|5|6|7|8|9\s+', 'i')) or @class = 'chapter']
This just keys on any number in an h1 or h2, and, as Stinger suggests, this may not be best for all ebooks. So I only use the above when the standard version fails. This probably isn't the cleanest way to test for digits, but it works.
wallcraft is offline   Reply With Quote
Advert
Old 05-31-2010, 05:36 AM   #6
fiendmish
Junior Member
fiendmish began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Device: Nook
Thanks

Thanks wallcraft will give that a go now.

Cheers
Sam
fiendmish is offline   Reply With Quote
Old 05-31-2010, 10:45 AM   #7
pietvo
Reader
pietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notes
 
pietvo's Avatar
 
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
You could try something like:
Code:
//*[re:test(., '^\s*[0-9]+\s+[A-Z ]+\s*$')]
but we would need to see the actual HTML around the chapter titles to be sure. Notice the space after the Z.
pietvo is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to MOBI conversion - unable to detect any words qwerty123456 Calibre 1 07-22-2010 07:54 AM
Should ''internet'' be capitalized or lowercase? taglines Lounge 18 07-06-2010 04:15 AM
Repeated Chapter Headings in Kobo Table of Contents capsolo Sigil 5 06-20-2010 03:09 AM
Managing HTML Link Behavior, From TOC to Chapter Headings FlooseMan Dave Calibre 1 03-31-2010 11:55 PM
Image maps as Chapter Headings? kjk Sigil 6 10-17-2009 06:27 PM


All times are GMT -4. The time now is 05:40 PM.


MobileRead.com is a privately owned, operated and funded community.