Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-16-2010, 09:46 AM   #1
smartmart
Junior Member
smartmart began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Oct 2010
Device: Kindle
Regular Expression Help

Hi, i've converted a pdf to awz with the Amazon service, now i want convert it to mobi with Calibre (so i can add metadata and TOC).

I've a problem with chapter recognition, every chapter start with "Chapter XXX." so my regex is:
//*[re:test(., "chapter", "i")]

It works with the original pdf but not with the awz.
it matchs the word "chapter" in the text (sometimes there is the word "chapter" in the script) but it doesn't match the real chapters.

So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse):
<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>

Is this the problem?
How can i fix it?

Thx
smartmart is offline   Reply With Quote
Old 10-16-2010, 10:34 AM   #2
desertgrandma
Enjoying the show....
desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.
 
desertgrandma's Avatar
 
Posts: 14,270
Karma: 10462841
Join Date: Jun 2008
Location: Arizona
Device: A K1, Kindle Paperwhite, an Ipod, IPad2, Iphone, an Ipad Mini & macAir
Welcome to MobileRead, smartmart

Help should be arriving soon.
desertgrandma is offline   Reply With Quote
Advert
Old 10-16-2010, 11:17 AM   #3
BookGnome
Voracious Reader
BookGnome is on a distinguished road
 
BookGnome's Avatar
 
Posts: 4
Karma: 62
Join Date: Sep 2010
Device: Kindle
Finding chapters with a simple regex

Quote:
Originally Posted by smartmart View Post
So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse):
<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>
I'm not sure how you need to specify it with Calibre's custom syntax, but your regex itself is flawed. Here's a working regex in Python:

Code:
>>> import re
>>> myString = '<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>'
>>> re.findall('Chapter \d+', myString, re.I)
['CHAPTER 1']
A lot depends on how consistent the input file is, but this should catch any instance of the word 'chapter' followed by one or more numbers, without regard to case. How to wrap that in Calibre's regex DSL is a question for the Calibre gurus.
BookGnome is offline   Reply With Quote
Old 10-16-2010, 12:22 PM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I'm terrible with xpath, but I have a hunch you're screwed trying to search for text just free floating throughout the book in the body tag.

You're best bet is to take the html from debug info and do a find replace in a text editor with regex search/replace support.

Search for this:
Code:
(Chapter\s+\d+)
and replace it with this:
Code:
<h2>\1</h2>
Depending on the editor you use it might be $(1), or $1, or whatever instead of '\1' as I used above - check the documentation for your editor.

Then import the edited html file to Calibre, and have Calibre convert using the zipped html source instead of the pdf. Calibre's default chapter detection xpath will automatically pick the chapters up if your search and replace properly wrapped the html in <h2> tags.
ldolse is offline   Reply With Quote
Old 10-17-2010, 04:07 AM   #5
smartmart
Junior Member
smartmart began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Oct 2010
Device: Kindle
Quote:
Originally Posted by BookGnome View Post
I'm not sure how you need to specify it with Calibre's custom syntax, but your regex itself is flawed. Here's a working regex in Python:

Code:
>>> import re
>>> myString = '<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>'
>>> re.findall('Chapter \d+', myString, re.I)
['CHAPTER 1']
A lot depends on how consistent the input file is, but this should catch any instance of the word 'chapter' followed by one or more numbers, without regard to case. How to wrap that in Calibre's regex DSL is a question for the Calibre gurus.
I know, i've used a wide regex only for testing purpose

Thx Idosle, but i'm searching for a solution in calibre (if it's possible) so i can use the setting every time.

PS: i don't use the pdf to mobi from calibre because it fails with the wrap.
It seems that every page of the pdf is a paragraph.

Last edited by smartmart; 10-17-2010 at 04:30 AM.
smartmart is offline   Reply With Quote
Advert
Old 10-17-2010, 05:19 AM   #6
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
There is no solution in Calibre to do it the way you're trying to do it. Amazon is creating a really screwy mobi file, Calibre hasn't been programmed to handle that scenario, and It's unlikely to happen anytime soon.

If you're seeing that Calibre isn't unwrapping the lines when you use it to convert from pdf to mobi it means that your line unwrapping factor under pdf input is incorrectly set. It seems to be set to zero on a number of user's systems. Set the line unwrapping factor to 0.45, this is the default and generally provides the best results.
ldolse is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help Azhad Calibre 86 09-27-2011 02:37 PM
Need Help Creating a Regular Expression Worm Calibre 9 08-18-2010 01:20 PM
Regular Expression Help Needed dloyer4 Calibre 1 07-25-2010 10:37 PM
Help with the regular expression Dysonco Calibre 9 03-22-2010 10:45 PM
I don't know how to use wilcards and regular expression.... superanima Sigil 4 02-21-2010 09:42 AM


All times are GMT -4. The time now is 05:35 AM.


MobileRead.com is a privately owned, operated and funded community.