11-02-2010, 09:13 AM | #1 | |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
Multiple line regexp
I want to convert pdf to mobi, generating the content based on regexp. The format of a chapter header is as follows:
Quote:
Code:
^[0-9]+\.$ |
|
11-02-2010, 10:03 AM | #2 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Try this:
Code:
^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+ Last edited by Manichean; 11-02-2010 at 10:05 AM. Reason: Fix regex |
Advert | |
|
11-02-2010, 10:23 AM | #3 |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
Thanks for reply. Unfortunately this does not work:
a) it does not catch multiple lines at all (which is what I need) b) it catches eg year or date in the text Simply said, I need the opposite of what this regexp does - not catching numbers followed by alphabetical text in the same line but catching numbers followed by alphabetical text in the next N lines. |
11-02-2010, 10:31 AM | #4 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Huh. You sure? I just tested the expression in Python and it matches the test case you specified in your original post. Did you have a look at the source Calibre sees while applying the regex (use the magic wand symbol)?
Edit to add: Given your test case, I assumed you wanted to write a regex that catches a number followed by a dot, followed again by two groups of strings (including spaces) separated by one or more whitespaces. That's exactly what the regex I wrote matches. Last edited by Manichean; 11-02-2010 at 10:33 AM. |
11-02-2010, 10:33 AM | #5 |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
You probably need to use the DOTALL flag on your regexp so that \n is matched by '.' and the character classes.
Without looking at the regexp at all to see if my suggestion makes sense, try Code:
(?s)^[0-9]+\.\s+[a-zA-Z ]+\s+[a-zA-Z ]+ |
Advert | |
|
11-02-2010, 10:35 AM | #6 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
(Edit: For some funny reason, in the quoted regex above, my browser displays vertical bars instead of backslashes in some cases. I assure you that there are no vertical bars there ) |
|
11-02-2010, 10:45 AM | #7 |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
I suspect that lines are passed one by one to the TOC creating code? That would explain the behaviour. Note also the trailing $ at the end of the *matching* regexp that I mentioned in my first post...
I am not sure I understand your magic wand source question - I thought the magic wand is just meant to specify the XPath in a semi-automatic way? Anyway, in case you are asking about the complete XPath I used, then it was this (only trying to catch the first part of chapter name to simplify things): Code:
//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")] Last edited by janvanmaar; 11-02-2010 at 10:47 AM. |
11-02-2010, 10:49 AM | #8 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Manichean's correct - DOTALL needs a dot. That said, \s+ will traverse newlines, so as long as that's in the right place the proposed regexes should be ok.
All that said, I can't tell what you're really trying to do - where are you placing this regex that it's doing something useful for you? PDF has some hard-coded regexes to do basically what you're asking for. One built-in pdf regex - one that has more false positives - only becomes enabled when preprocessing is enabled. I could have sworn that a single number followed by an optional dot (and optionally followed by a title on a second line) was already covered by the default regex.... You could open a bug with your file if you like and we can take a look at it from there. |
11-02-2010, 10:56 AM | #9 |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
I think my suspicion was correct, the XHTML looks like this:
Code:
<p class="P-kapit">2.</p> <p class="P-P32">Name of chapter</p> Based on this, I think it should be simple to create the TOC with either names of the chapters or with their numbers (for this specific file), however it is not clear to me how to construct an XPath to create something like "number: name" structure. Is that possible at all with current Calibre interface? |
11-02-2010, 11:01 AM | #10 |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
ldolse: I have sent the previous message before reading yours. So what I was doing is simply using the following XPath in Level 1 TOC settings in the Conversion GUI:
Code:
//*[re:test(., "^[0-9]+\.\s+[a-zA-Z]+")] |
11-02-2010, 11:05 AM | #11 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
@janvanmaar: I didn't read your first post carefully. My understanding of xpath produces strings for a given html tag. To know what the regexp will do, you must know the tag structure around the text. My guess is that you will find something like Code:
<p>2.</p><p>Name of the chapter</p> When faced with this problem, I have done one of three things: 1. convert the PDF to epub, use an editor to enclose the chapter indicators in <h1>...</h1> tags, and convert again. 2. similar, but go through .txt so I can clean up other stuff such as paragraph endings. 3. live without a toc. EDIT: Everything in this post has been covered above. I should just take a nap. |
|
11-02-2010, 11:15 AM | #12 | |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
Quote:
Anyway, thanks for all comments, now I know what is the problem at least. |
|
11-02-2010, 11:28 AM | #13 |
Addict
Posts: 219
Karma: 404
Join Date: Nov 2010
Device: Kindle 3G, Samsung SIII
|
Ok, looking at the problem, I think the main question is:
Is it possible to create the chapter name from multiple tags? I know that I have something like Code:
<p>2.</p> <p>Name of the chapter</p> |
11-02-2010, 11:50 AM | #14 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
Code:
//h:p[re.test(., "[0-9]+\.",)]//h:p[re.test(., "[a-z ]+", "i")] I don't know, however, if the matching works at all for multiple tags. Might be worth a try, though. |
|
11-02-2010, 11:55 AM | #15 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
Also, like you said, the documentation implies otherwise |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
<Command Line> Add multiple books in multiple formats | himitsu | Calibre | 8 | 09-25-2010 11:07 PM |
Bug: entries with multiple formats trigger multiple conversions | flinx1 | Calibre | 12 | 05-21-2010 06:23 AM |
Gen3 Multiple dictionaries? | miquele | Bookeen | 3 | 05-19-2010 04:16 PM |
Regexp and header/footer problems | concern | Calibre | 0 | 02-07-2010 03:35 AM |
I'm in line | Tangabird | Introduce Yourself | 4 | 11-12-2009 08:13 AM |