Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-25-2010, 10:49 PM   #1
prky
Member
prky began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
Multiline Regex?

I'm having trouble with trimming a multi-line regex from a pdf.

The PDF is Ancestor by Scott Sigler, available free online: http://media.libsyn.com/media/scotts...cottSigler.pdf

Looking at it in the Regex builder, I see blocks like:


Code:
Pearcy pointed to another phone, this one built into the equipment-thick control panel. </p><p>
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>
<a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p>
Ancestor ~ Scott Sigler</p><p>
“That’s a straight line to Langley. Just pick it up and it will ring through.” </p><p>
I want to trim

Code:
^Order your copy.*Ancestor ~ Scott Sigler</p><p>$
If I use:

Code:
(?mi)^Order.*$
It highlights the line:

Code:
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>
Which is expected.

But the minute I try multiple lines, it fails. Eg, If I use:

Code:
(?mi)^Order.*1896944736
It highlights nothing.

How can I get a regex to match:

Code:
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>
<a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p>
Ancestor ~ Scott Sigler</p><p>
?

Ta,

prk.
prky is offline   Reply With Quote
Old 04-25-2010, 11:57 PM   #2
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
I don't know about Calibre, but usual behavior of regexp libraries in multiline mode is to match ^ at the very beginning of the whole string rather than a line. Same with $. In multiline mode, you should search for something like:

Code:
\nOrder your copy...\n
And be particularly careful with matching a dot (.), as it will match anything including newlines. Code such as .+ will match the whole document. If you really need to use the dot, and frankyly I would recommend trying something like [^\r\n] instead, at least use the ungreedy versions of + and *: +?, *?
pepak is offline   Reply With Quote
Old 04-26-2010, 04:08 AM   #3
prky
Member
prky began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
Thanks pepak.

The problem I have, is that .* isn't matching across lines, or rather (?mi) or (?m) isn't making .* match across lines.

Eg, it won't even match

Code:
(?m)\nOrder your copy
ie it doesn't like the \n on one line, then Order on the next, yet it matches perfectly on

Code:
(?m)Order your copy
The minute I do anything to the regex to make it match the line before or the line after, it doesn't match anything.

Are you able to test any multiline regex (ideally from a pdf source) either using the remove header or remove footer options, and have the regex builder highlight two or more lines?

If it helps, I'm running calibre 0.6.49 now, but I couldn't get it working in 0.6.45 either.

prk.
prky is offline   Reply With Quote
Old 04-26-2010, 04:48 AM   #4
prky
Member
prky began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
Okay, so I can make .* match across newlines by using the s flag.

And I can test it in python to confirm it works, but it doesn't work in calibre.

If I go to http://www.pythonregex.com/ and put in the regex as:

Code:
(?mis)Order your copy.*?Ancestor ~ Scott Sigler.*?$
and the String as:

Code:
Pearcy pointed to another phone, this one built into the equipment-thick control panel. </p><p>
Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>
<a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p>
Ancestor ~ Scott Sigler</p><p>
“That’s a straight line to Langley. Just pick it up and it will ring through.” </p><p>
Then that regex tester reports that it's found:

Code:
# Run findall
>>> regex.findall(string)
[u'Order your copy from Amazon.com on April 1: Noon Eastern Time, 9AM Pacicific</p><p>\r\n<a href="http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736">http://www.amazon.com/Ancestor-Scott-Sigler/dp/1896944736</a></p><p>\r\nAncestor ~ Scott Sigler</p><p>\r']
That's what I want it to match - so it looks like we're in business.

Except...

When I paste that exact same regex into the calibre regex builder, it doesn't highlight anything

*tears more hair out*

prk.
prky is offline   Reply With Quote
Old 04-26-2010, 07:10 AM   #5
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
I think you should only use the s flag, not both s and m.
pepak is offline   Reply With Quote
Old 04-27-2010, 08:47 AM   #6
prky
Member
prky began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2009
Device: IPhone 3GS
Hi pepak,

Alas, using
Code:
(?is)Order your copy.*?Ancestor ~ Scott Sigler.*?$
doesn't work either

Have you been able to make any mulitline regex work in Calibre?

Given the above test, where the regex works fine in http://www.pythonregex.com/ but it doesn't in Calibre, I'm suspecting it's a Calibre bug.

How do I go about confirming this / lodging it?

prk.
prky is offline   Reply With Quote
Old 04-27-2010, 09:32 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by prky View Post
Given the above test, where the regex works fine in http://www.pythonregex.com/ but it doesn't in Calibre, I'm suspecting it's a Calibre bug.

How do I go about confirming this / lodging it?
I've seen several discussions of problems with multiline regex matching. I don't do much conversion, so I've never looked at it, but I've read the threads, and the impression I get is that it may lie in the fact that the conversion process is a pipeline, and the regex may be applied at a later point in the pipeline than is expected. Or I may be totally off base.

As to how to confirm, you can always look at the code, which is open. As to how to report it, the bug tracker is the place.

As to a workaround, have you tried using two single line matches? Again, I'm no expert, but I have the impression that you could single line match the header and the footer to remove two single lines. Despite the fact that the header/footer removals are labeled as such, I think they match anywhere.
Starson17 is offline   Reply With Quote
Old 04-27-2010, 10:51 AM   #8
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by prky View Post
Hi pepak,

Alas, using
Code:
(?is)Order your copy.*?Ancestor ~ Scott Sigler.*?$
doesn't work either
I am not surprised. I told you not to use ^ and $.

Quote:
Have you been able to make any mulitline regex work in Calibre?
I only use Calibre for conversions, from command-line at that. I don't know (nor care, really) what regexp library it uses. My examples are made for PCRE.
pepak is offline   Reply With Quote
Old 04-29-2010, 03:44 AM   #9
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Just looked at the Bug tickets. It seems that there are two open tickets about this issue, but they're getting kinda old, and haven't received any recent attention.
tonyx3 is offline   Reply With Quote
Old 04-29-2010, 04:12 AM   #10
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by prky View Post

The PDF is Ancestor by Scott Sigler, available free online: http://media.libsyn.com/media/scotts...cottSigler.pdf
I was able to quickly eliminate your concern by importing the pdf into mobipocket creator (free) and viewing the resultant html file. But the creators of this document have purposely made it very difficult (via formatting) to easily convert it to any other format.
DoctorOhh is offline   Reply With Quote
Old 04-29-2010, 10:38 AM   #11
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by dwanthny View Post
I was able to quickly eliminate your concern by importing the pdf into mobipocket creator (free) and viewing the resultant html file. But the creators of this document have purposely made it very difficult (via formatting) to easily convert it to any other format.

I'm pretty sure it's not just this one document, though. Having attempted it on several different files, I've never gotten a multiline regex to work, even when it did work in the python regex tester. And I'm not the only one. There have been a couple other thread about it, none with good resolution, and there's a couple old bug tickets about multiline regex not working.

It's not a fluke occurrence...
tonyx3 is offline   Reply With Quote
Old 04-29-2010, 04:12 PM   #12
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by tonyx3 View Post
I've never gotten a multiline regex to work...
It's not a fluke occurrence...
Did you try my suggestion of using two matches (header and footer), instead of one multiline match?
Starson17 is offline   Reply With Quote
Old 04-29-2010, 05:49 PM   #13
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
There is in fact a problem here, but not the one the OP was suggesting.

First (the non-problem): the regex "(?ism)order your.*?736.*?$" removes the footer on this document, assuming that one remembers to check the remove footer box.

Second (the problem): the regexp tester does not show multi-line matches. The problem is that it uses the QSyntaxHighlighter, which my experimentation shows to be a line-oriented interface, making highlighting multiline matches impossible. I think that the regex texter should match directly against the text in the QTextEdit box, using something like setTextBackgroundColor to indicate matches. I admit that I haven't hacked the code to try this idea, but it seems plausible. I will file a ticket on this so that someone more acquainted than I am with the widgets can think about a solution.
chaley is offline   Reply With Quote
Old 04-29-2010, 09:35 PM   #14
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by Starson17 View Post
Did you try my suggestion of using two matches (header and footer), instead of one multiline match?
I suppose that might work in some cases (where the two lines wrap the same every time), but it's a workaround, not a solution for the bug.


Quote:
Originally Posted by chaley
The problem is that it uses the QSyntaxHighlighter, which my experimentation shows to be a line-oriented interface, making highlighting multiline matches impossible.
Ahh, I'm glad someone with some know-how identified the issue. So are you saying that if the regex is valid across lines, it will work, even if the regex tester in calibre doesn't highlight across lines?
tonyx3 is offline   Reply With Quote
Old 04-29-2010, 11:20 PM   #15
adolson
Member
adolson began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2010
Device: PRS-300
I have a very similar problem stripping a footer and header pair, and as far as I know, I can't break it into two regexps because my header is just some characters followed by </p><p>, just like the actual content of the pages.

Multi-line doesn't want to work for me. Now, I am mostly familiar with perl regex, but I read this thread and several others, and I couldn't figure this out.
adolson is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML Conversion - Multiline Headers prky Calibre 1 07-03-2010 09:24 AM
What a regex is Worldwalker Calibre 20 05-10-2010 05:51 AM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM
Multiline Regex Footer hover Calibre 10 02-03-2010 04:23 AM
Regex help... Bobthebass Workshop 6 04-26-2009 03:54 PM


All times are GMT -4. The time now is 11:27 AM.


MobileRead.com is a privately owned, operated and funded community.