Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 11-02-2010, 09:13 AM   #1
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
Multiple line regexp

I want to convert pdf to mobi, generating the content based on regexp. The format of a chapter header is as follows:
Quote:
2.

Name of the
chapter
Catching the numbers by using simple regexp
Code:
^[0-9]+\.$
works fine. Is it possible to catch the name of the chapter as well though? As I expected, using \n in regexp does not work...
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 10:03 AM   #2
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Try this:
Code:
^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+
That should describe the case you gave.

Last edited by Manichean; 11-02-2010 at 10:05 AM. Reason: Fix regex
Manichean is offline   Reply With Quote
Advert
Old 11-02-2010, 10:23 AM   #3
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
Thanks for reply. Unfortunately this does not work:
a) it does not catch multiple lines at all (which is what I need)
b) it catches eg year or date in the text
Simply said, I need the opposite of what this regexp does - not catching numbers followed by alphabetical text in the same line but catching numbers followed by alphabetical text in the next N lines.
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 10:31 AM   #4
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Huh. You sure? I just tested the expression in Python and it matches the test case you specified in your original post. Did you have a look at the source Calibre sees while applying the regex (use the magic wand symbol)?

Edit to add: Given your test case, I assumed you wanted to write a regex that catches a number followed by a dot, followed again by two groups of strings (including spaces) separated by one or more whitespaces. That's exactly what the regex I wrote matches.

Last edited by Manichean; 11-02-2010 at 10:33 AM.
Manichean is offline   Reply With Quote
Old 11-02-2010, 10:33 AM   #5
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes.

Without looking at the regexp at all to see if my suggestion makes sense, try
Code:
(?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+
chaley is offline   Reply With Quote
Advert
Old 11-02-2010, 10:35 AM   #6
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by chaley View Post
You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes.

Without looking at the regexp at all to see if my suggestion makes sense, try
Code:
(?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+
Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there )
Manichean is offline   Reply With Quote
Old 11-02-2010, 10:45 AM   #7
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
I suspect that lines are passed one by one to the TOC creating code? That would explain the behaviour. Note also the trailing $ at the end of the *matching* regexp that I mentioned in my first post...

I am not sure I understand your magic wand source question - I thought the magic wand is just meant to specify the XPath in a semi-automatic way? Anyway, in case you are asking about the complete XPath I used, then it was this (only trying to catch the first part of chapter name to simplify things):
Code:
//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")]
I am going to have a look at the XHTML intermediate, perhaps that will tell me something more...

Last edited by janvanmaar; 11-02-2010 at 10:47 AM.
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 10:49 AM   #8
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Manichean's correct - DOTALL needs a dot. That said, \s+ will traverse newlines, so as long as that's in the right place the proposed regexes should be ok.

All that said, I can't tell what you're really trying to do - where are you placing this regex that it's doing something useful for you? PDF has some hard-coded regexes to do basically what you're asking for. One built-in pdf regex - one that has more false positives - only becomes enabled when preprocessing is enabled. I could have sworn that a single number followed by an optional dot (and optionally followed by a title on a second line) was already covered by the default regex....

You could open a bug with your file if you like and we can take a look at it from there.
ldolse is offline   Reply With Quote
Old 11-02-2010, 10:56 AM   #9
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
I think my suspicion was correct, the XHTML looks like this:
Code:
<p class="P-kapit">2.</p>                                                                                                       
<p class="P-P32">Name of chapter</p>
So it seems that I cannot grep over multiple lines directly as they are not passed together to the TOC creation engine.

Based on this, I think it should be simple to create the TOC with either names of the chapters or with their numbers (for this specific file), however it is not clear to me how to construct an XPath to create something like "number: name" structure. Is that possible at all with current Calibre interface?
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 11:01 AM   #10
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
ldolse: I have sent the previous message before reading yours. So what I was doing is simply using the following XPath in Level 1 TOC settings in the Conversion GUI:
Code:
//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")]
However, as my previous post says, this does not work because the multiline is broken in the XHTML conversion. At least that's my current understanding...
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 11:05 AM   #11
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by Manichean View Post
Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there )
DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.

@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like
Code:
<p>2.</p><p>Name of the chapter</p>
This input would account for '^[0-9]+\.$' working, because it is the content of an inner <p> tag. The rest will require multi-tag matching, which means regexps of a rather higher order. For example, you probably won't be able to use anchors.

When faced with this problem, I have done one of three things:
1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again.
2. similar, but go through .txt so I can clean up other stuff such as paragraph endings.
3. live without a toc.

EDIT: Everything in this post has been covered above. I should just take a nap.
chaley is offline   Reply With Quote
Old 11-02-2010, 11:15 AM   #12
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
Quote:
Everything in this post has been covered above.
Not everything: your last paragraph may be providing the right direction. Converting to a different format, modifying by hand/script and then converting to mobi should work... except that it is a bit tedious and I don't like it
Anyway, thanks for all comments, now I know what is the problem at least.
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 11:28 AM   #13
janvanmaar
Addict
janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.janvanmaar has a complete set of Star Wars action figures.
 
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
Ok, looking at the problem, I think the main question is:

Is it possible to create the chapter name from multiple tags?
I know that I have something like
Code:
<p>2.</p>
<p>Name of the chapter</p>
in the XHTML and I can catch the first tag eg by the regexp from my first post. Now is there a way to construct chapter name "2. Name of the chapter" from the two tags above? Or is the interface of Calibre not general enough for this knd of task?
janvanmaar is offline   Reply With Quote
Old 11-02-2010, 11:50 AM   #14
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by janvanmaar View Post
I think my suspicion was correct, the XHTML looks like this:
Code:
<p class="P-kapit">2.</p>                                                                                                       
<p class="P-P32">Name of chapter</p>
So it seems that I cannot grep over multiple lines directly as they are not passed together to the TOC creation engine.
I've seen multiline matching working before, so I'm guessing that the creation engine sees the whole source at once. However, the problem here is that the regex (or XPath, which is what you'd have to use for TOC creation) doesn't match the source because of the tags present. I don't know XPath as well as I do regexes, but I'm guessing that
Code:
//h:p[re.test(., "[0-9]+\.",)]//h:p[re.test(., "[a-z ]+", "i")]
should do the trick.

I don't know, however, if the matching works at all for multiple tags. Might be worth a try, though.
Manichean is offline   Reply With Quote
Old 11-02-2010, 11:55 AM   #15
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by chaley View Post
DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.
To test this, you'd need a character class (am I the only one who confuses this with RPGs? ) that include whitespaces, but by default doesn't include linebreaks, wouldn't you? From what I know, the only character class including whitespaces is \s, which by default includes the linebreak.
Also, like you said, the documentation implies otherwise
Manichean is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
<Command Line> Add multiple books in multiple formats himitsu Calibre 8 09-25-2010 11:07 PM
Bug: entries with multiple formats trigger multiple conversions flinx1 Calibre 12 05-21-2010 06:23 AM
Gen3 Multiple dictionaries? miquele Bookeen 3 05-19-2010 04:16 PM
Regexp and header/footer problems concern Calibre 0 02-07-2010 03:35 AM
I'm in line Tangabird Introduce Yourself 4 11-12-2009 08:13 AM


All times are GMT -4. The time now is 08:46 AM.


MobileRead.com is a privately owned, operated and funded community.