04-25-2010, 10:49 PM | #1 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
|
Multiline Regex?
I'm having trouble with trimming a multi-line regex from a pdf.
The PDF is Ancestor by Scott Sigler, available free online: http://media.libsyn.com/media/scotts...cottSigler.pdf Looking at it in the Regex builder, I see blocks like: Code:
Pearcy pointed to another phone, this one built into the equipment-thick control panel. </p><p> Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p> <a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p> Ancestor ~ Scott Sigler</p><p> “That’s a straight line to Langley. Just pick it up and it will ring through.” </p><p> Code:
^Order your copy.*Ancestor ~ Scott Sigler</p><p>$ Code:
(?mi)^Order.*$ Code:
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p> But the minute I try multiple lines, it fails. Eg, If I use: Code:
(?mi)^Order.*1896944736 How can I get a regex to match: Code:
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p> <a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p> Ancestor ~ Scott Sigler</p><p> Ta, prk. |
04-25-2010, 11:57 PM | #2 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
I don't know about Calibre, but usual behavior of regexp libraries in multiline mode is to match ^ at the very beginning of the whole string rather than a line. Same with $. In multiline mode, you should search for something like:
Code:
\nOrder your copy...\n |
Advert | |
|
04-26-2010, 04:08 AM | #3 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
|
Thanks pepak.
The problem I have, is that .* isn't matching across lines, or rather (?mi) or (?m) isn't making .* match across lines. Eg, it won't even match Code:
(?m)\nOrder your copy Code:
(?m)Order your copy Are you able to test any multiline regex (ideally from a pdf source) either using the remove header or remove footer options, and have the regex builder highlight two or more lines? If it helps, I'm running calibre 0.6.49 now, but I couldn't get it working in 0.6.45 either. prk. |
04-26-2010, 04:48 AM | #4 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
|
Okay, so I can make .* match across newlines by using the s flag.
And I can test it in python to confirm it works, but it doesn't work in calibre. If I go to http://www.pythonregex.com/ and put in the regex as: Code:
(?mis)Order your copy.*?Ancestor ~ Scott Sigler.*?$ Code:
Pearcy pointed to another phone, this one built into the equipment-thick control panel. </p><p> Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p> <a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p> Ancestor ~ Scott Sigler</p><p> “That’s a straight line to Langley. Just pick it up and it will ring through.” </p><p> Code:
# Run findall >>> regex.findall(string) [u'Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>\r\n<a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p>\r\nAncestor ~ Scott Sigler</p><p>\r'] Except... When I paste that exact same regex into the calibre regex builder, it doesn't highlight anything *tears more hair out* prk. |
04-26-2010, 07:10 AM | #5 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
I think you should only use the s flag, not both s and m.
|
Advert | |
|
04-27-2010, 08:47 AM | #6 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
|
Hi pepak,
Alas, using Code:
(?is)Order your copy.*?Ancestor ~ Scott Sigler.*?$ Have you been able to make any mulitline regex work in Calibre? Given the above test, where the regex works fine in http://www.pythonregex.com/ but it doesn't in Calibre, I'm suspecting it's a Calibre bug. How do I go about confirming this / lodging it? prk. |
04-27-2010, 09:32 AM | #7 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
As to how to confirm, you can always look at the code, which is open. As to how to report it, the bug tracker is the place. As to a workaround, have you tried using two single line matches? Again, I'm no expert, but I have the impression that you could single line match the header and the footer to remove two single lines. Despite the fact that the header/footer removals are labeled as such, I think they match anywhere. |
|
04-27-2010, 10:51 AM | #8 | ||
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
|
||
04-29-2010, 03:44 AM | #9 |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Just looked at the Bug tickets. It seems that there are two open tickets about this issue, but they're getting kinda old, and haven't received any recent attention.
|
04-29-2010, 04:12 AM | #10 | |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
|
|
04-29-2010, 10:38 AM | #11 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Quote:
I'm pretty sure it's not just this one document, though. Having attempted it on several different files, I've never gotten a multiline regex to work, even when it did work in the python regex tester. And I'm not the only one. There have been a couple other thread about it, none with good resolution, and there's a couple old bug tickets about multiline regex not working. It's not a fluke occurrence... |
|
04-29-2010, 04:12 PM | #12 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
04-29-2010, 05:49 PM | #13 |
Grand Sorcerer
Posts: 11,940
Karma: 7219261
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
There is in fact a problem here, but not the one the OP was suggesting.
First (the non-problem): the regex "(?ism)order your.*?736.*?$" removes the footer on this document, assuming that one remembers to check the remove footer box. Second (the problem): the regexp tester does not show multi-line matches. The problem is that it uses the QSyntaxHighlighter, which my experimentation shows to be a line-oriented interface, making highlighting multiline matches impossible. I think that the regex texter should match directly against the text in the QTextEdit box, using something like setTextBackgroundColor to indicate matches. I admit that I haven't hacked the code to try this idea, but it seems plausible. I will file a ticket on this so that someone more acquainted than I am with the widgets can think about a solution. |
04-29-2010, 09:35 PM | #14 | ||
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Quote:
Quote:
|
||
04-29-2010, 11:20 PM | #15 |
Member
Posts: 15
Karma: 10
Join Date: Apr 2010
Device: PRS-300
|
I have a very similar problem stripping a footer and header pair, and as far as I know, I can't break it into two regexps because my header is just some characters followed by </p><p>, just like the actual content of the pages.
Multi-line doesn't want to work for me. Now, I am mostly familiar with perl regex, but I read this thread and several others, and I couldn't figure this out. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTML Conversion - Multiline Headers | prky | Calibre | 1 | 07-03-2010 09:24 AM |
What a regex is | Worldwalker | Calibre | 20 | 05-10-2010 05:51 AM |
Help with a regex | A.T.E. | Calibre | 1 | 04-05-2010 07:50 AM |
Multiline Regex Footer | hover | Calibre | 10 | 02-03-2010 04:23 AM |
Regex help... | Bobthebass | Workshop | 6 | 04-26-2009 03:54 PM |