Multiple line regexp

janvanmaar · 11-02-2010, 09:13 AM

I want to convert pdf to mobi, generating the content based on regexp. The format of a chapter header is as follows:

Quote:

2.

Name of the
chapter

Catching the numbers by using simple regexp

Code:

^[0-9]+\.$

works fine. Is it possible to catch the name of the chapter as well though? As I expected, using \n in regexp does not work...

Manichean · 11-02-2010, 10:03 AM

Try this:

Code:

^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+

That should describe the case you gave.

janvanmaar · 11-02-2010, 10:23 AM

Thanks for reply. Unfortunately this does not work:
a) it does not catch multiple lines at all (which is what I need)
b) it catches eg year or date in the text
Simply said, I need the opposite of what this regexp does - not catching numbers followed by alphabetical text in the same line but catching numbers followed by alphabetical text in the next N lines.

Manichean · 11-02-2010, 10:31 AM

Huh. You sure? I just tested the expression in Python and it matches the test case you specified in your original post. Did you have a look at the source Calibre sees while applying the regex (use the magic wand symbol)?

Edit to add: Given your test case, I assumed you wanted to write a regex that catches a number followed by a dot, followed again by two groups of strings (including spaces) separated by one or more whitespaces. That's exactly what the regex I wrote matches.

chaley · 11-02-2010, 10:33 AM

You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes.

Without looking at the regexp at all to see if my suggestion makes sense, try

Code:

(?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+

Manichean · 11-02-2010, 10:35 AM

Quote:

Originally Posted by chaley

You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes.

Without looking at the regexp at all to see if my suggestion makes sense, try

Code:

(?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+

Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there

)

janvanmaar · 11-02-2010, 10:45 AM

I suspect that lines are passed one by one to the TOC creating code? That would explain the behaviour. Note also the trailing $ at the end of the *matching* regexp that I mentioned in my first post...

I am not sure I understand your magic wand source question - I thought the magic wand is just meant to specify the XPath in a semi-automatic way? Anyway, in case you are asking about the complete XPath I used, then it was this (only trying to catch the first part of chapter name to simplify things):

Code:

//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")]

I am going to have a look at the XHTML intermediate, perhaps that will tell me something more...

ldolse · 11-02-2010, 10:49 AM

Manichean's correct - DOTALL needs a dot. That said, \s+ will traverse newlines, so as long as that's in the right place the proposed regexes should be ok.

All that said, I can't tell what you're really trying to do - where are you placing this regex that it's doing something useful for you? PDF has some hard-coded regexes to do basically what you're asking for. One built-in pdf regex - one that has more false positives - only becomes enabled when preprocessing is enabled. I could have sworn that a single number followed by an optional dot (and optionally followed by a title on a second line) was already covered by the default regex....

You could open a bug with your file if you like and we can take a look at it from there.

janvanmaar · 11-02-2010, 10:56 AM

I think my suspicion was correct, the XHTML looks like this:

Code:

<p class="P-kapit">2.</p>                                                                                                       
<p class="P-P32">Name of chapter</p>

So it seems that I cannot grep over multiple lines directly as they are not passed together to the TOC creation engine.

Based on this, I think it should be simple to create the TOC with either names of the chapters or with their numbers (for this specific file), however it is not clear to me how to construct an XPath to create something like "number: name" structure. Is that possible at all with current Calibre interface?

janvanmaar · 11-02-2010, 11:01 AM

ldolse: I have sent the previous message before reading yours. So what I was doing is simply using the following XPath in Level 1 TOC settings in the Conversion GUI:

Code:

//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")]

However, as my previous post says, this does not work because the multiline is broken in the XHTML conversion. At least that's my current understanding...

chaley · 11-02-2010, 11:05 AM

Quote:

Originally Posted by Manichean

Doesn't the DOTALL only work if there's a dot as a quantifier? The only dot in that regex is a literal one.
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there

)

DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.

@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like

Code:

<p>2.</p><p>Name of the chapter</p>

This input would account for '^[0-9]+\.$' working, because it is the content of an inner <p> tag. The rest will require multi-tag matching, which means regexps of a rather higher order. For example, you probably won't be able to use anchors.

When faced with this problem, I have done one of three things:
1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again.
2. similar, but go through .txt so I can clean up other stuff such as paragraph endings.
3. live without a toc.

EDIT: Everything in this post has been covered above. I should just take a nap.

janvanmaar · 11-02-2010, 11:15 AM

Quote:

Everything in this post has been covered above.

Not everything: your last paragraph may be providing the right direction. Converting to a different format, modifying by hand/script and then converting to mobi should work... except that it is a bit tedious and I don't like it

Anyway, thanks for all comments, now I know what is the problem at least.

janvanmaar · 11-02-2010, 11:28 AM

Ok, looking at the problem, I think the main question is:

Is it possible to create the chapter name from multiple tags?
I know that I have something like

Code:

<p>2.</p>
<p>Name of the chapter</p>

in the XHTML and I can catch the first tag eg by the regexp from my first post. Now is there a way to construct chapter name "2. Name of the chapter" from the two tags above? Or is the interface of Calibre not general enough for this knd of task?

Manichean · 11-02-2010, 11:50 AM

Quote:

Originally Posted by janvanmaar

I think my suspicion was correct, the XHTML looks like this:

Code:

<p class="P-kapit">2.</p>                                                                                                       
<p class="P-P32">Name of chapter</p>

So it seems that I cannot grep over multiple lines directly as they are not passed together to the TOC creation engine.

I've seen multiline matching working before, so I'm guessing that the creation engine sees the whole source at once. However, the problem here is that the regex (or XPath, which is what you'd have to use for TOC creation) doesn't match the source because of the tags present. I don't know XPath as well as I do regexes, but I'm guessing that

Code:

//h:p[re.test(., "[0-9]+\.",)]//h:p[re.test(., "[a-z ]+", "i")]

should do the trick.

I don't know, however, if the matching works at all for multiple tags. Might be worth a try, though.

Manichean · 11-02-2010, 11:55 AM

Quote:

Originally Posted by chaley

DOTALL definitely affects '.'. What I am not sure of is whether or not it affects the character classes. The documentation implies that it does not, but I haven't tried it.

To test this, you'd need a character class (am I the only one who confuses this with RPGs?

) that include whitespaces, but by default doesn't include linebreaks, wouldn't you? From what I know, the only character class including whitespaces is \s, which by default includes the linebreak.
Also, like you said, the documentation implies otherwise

11-02-2010, 10:03 AM	#2
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Try this: Code: ^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+ That should describe the case you gave. Last edited by Manichean; 11-02-2010 at 10:05 AM. Reason: Fix regex

11-02-2010, 10:31 AM	#4
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Huh. You sure? I just tested the expression in Python and it matches the test case you specified in your original post. Did you have a look at the source Calibre sees while applying the regex (use the magic wand symbol)? Edit to add: Given your test case, I assumed you wanted to write a regex that catches a number followed by a dot, followed again by two groups of strings (including spaces) separated by one or more whitespaces. That's exactly what the regex I wrote matches. Last edited by Manichean; 11-02-2010 at 10:33 AM.

11-02-2010, 10:33 AM	#5
chaley Grand Sorcerer Posts: 11,741 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes. Without looking at the regexp at all to see if my suggestion makes sense, try Code: (?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+

11-02-2010, 10:45 AM	#7
janvanmaar Addict Posts: 219 Karma: 404 Join Date: Nov 2010 Device: Kindle 3G, Samsung SIII	I suspect that lines are passed one by one to the TOC creating code? That would explain the behaviour. Note also the trailing $ at the end of the matching regexp that I mentioned in my first post... I am not sure I understand your magic wand source question - I thought the magic wand is just meant to specify the XPath in a semi-automatic way? Anyway, in case you are asking about the complete XPath I used, then it was this (only trying to catch the first part of chapter name to simplify things): Code: //[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")] I am going to have a look at the XHTML intermediate, perhaps that will tell me something more... Last edited by janvanmaar; 11-02-2010 at 10:47 AM.*

11-02-2010, 10:56 AM	#9
janvanmaar Addict Posts: 219 Karma: 404 Join Date: Nov 2010 Device: Kindle 3G, Samsung SIII	I think my suspicion was correct, the XHTML looks like this: Code: <p class="P-kapit">2.</p> <p class="P-P32">Name of chapter</p> So it seems that I cannot grep over multiple lines directly as they are not passed together to the TOC creation engine. Based on this, I think it should be simple to create the TOC with either names of the chapters or with their numbers (for this specific file), however it is not clear to me how to construct an XPath to create something like "number: name" structure. Is that possible at all with current Calibre interface?

11-02-2010, 10:23 AM	#3
janvanmaar Addict Posts: 219 Karma: 404 Join Date: Nov 2010 Device: Kindle 3G, Samsung SIII	Thanks for reply. Unfortunately this does not work: a) it does not catch multiple lines at all (which is what I need) b) it catches eg year or date in the text Simply said, I need the opposite of what this regexp does - not catching numbers followed by alphabetical text in the same line but catching numbers followed by alphabetical text in the next N lines.

11-02-2010, 10:49 AM	#8
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Manichean's correct - DOTALL needs a dot. That said, \s+ will traverse newlines, so as long as that's in the right place the proposed regexes should be ok. All that said, I can't tell what you're really trying to do - where are you placing this regex that it's doing something useful for you? PDF has some hard-coded regexes to do basically what you're asking for. One built-in pdf regex - one that has more false positives - only becomes enabled when preprocessing is enabled. I could have sworn that a single number followed by an optional dot (and optionally followed by a title on a second line) was already covered by the default regex.... You could open a bug with your file if you like and we can take a look at it from there.

11-02-2010, 11:01 AM	#10
janvanmaar Addict Posts: 219 Karma: 404 Join Date: Nov 2010 Device: Kindle 3G, Samsung SIII	ldolse: I have sent the previous message before reading yours. So what I was doing is simply using the following XPath in Level 1 TOC settings in the Conversion GUI: Code: //*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")] However, as my previous post says, this does not work because the multiline is broken in the XHTML conversion. At least that's my current understanding...

11-02-2010, 11:28 AM	#13
janvanmaar Addict Posts: 219 Karma: 404 Join Date: Nov 2010 Device: Kindle 3G, Samsung SIII	Ok, looking at the problem, I think the main question is: Is it possible to create the chapter name from multiple tags? I know that I have something like Code: <p>2.</p> <p>Name of the chapter</p> in the XHTML and I can catch the first tag eg by the regexp from my first post. Now is there a way to construct chapter name "2. Name of the chapter" from the two tags above? Or is the interface of Calibre not general enough for this knd of task?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
<Command Line> Add multiple books in multiple formats	himitsu	Calibre	8	09-25-2010 11:07 PM
Bug: entries with multiple formats trigger multiple conversions	flinx1	Calibre	12	05-21-2010 06:23 AM
Gen3 Multiple dictionaries?	miquele	Bookeen	3	05-19-2010 04:16 PM
Regexp and header/footer problems	concern	Calibre	0	02-07-2010 03:35 AM
I'm in line	Tangabird	Introduce Yourself	4	11-12-2009 08:13 AM

Advert

Advert