Multiline Regex? - Page 2

chaley · 04-30-2010, 12:51 AM

Quote:

Originally Posted by tonyx3

Ahh, I'm glad someone with some know-how identified the issue.

All of the people who posted on this topic have "some know-how". However, not everyone is willing or able to volunteer a couple of hours of his/her time looking at a problem that is of little personal interest.

Quote:

So are you saying that if the regex is valid across lines, it will work, even if the regex tester in calibre doesn't highlight across lines?

Yes. For example, the regexp I posted removes the multiline footers on OP's PDF.

tonyx3 · 04-30-2010, 01:21 AM

Ok, so I suppose I didn't word that very tactfully. I was just referring that until you came along with an explanation of what the technical cause of the issue was, we all were just applying out knowledge of using regex, but none of us could get it to work, and none of us knew why. And it's been an issue in multiple threads, and a couple bug tickets, so it's not just an isolated issue in this thread.

In any case, thanks for spending your time on it.

chaley · 04-30-2010, 10:47 AM

I gave a fix for the regexp tester to Kovid (ticket #5414). I imagine that the fix (or something like it) will be released without too much delay.

adolson · 04-30-2010, 05:18 PM

I see that Kovid accepted the fix and will put it in the next release, which I guess will be pretty soon, looking at the release history (I am 1 day old to Calibre and eBooks in general - this is an impressive project, and the frequent releases and fast development are amazing to me).

I may just wait until then, but in the meantime, I just installed the latest version and am running Ubuntu 10.04. Here's a clip of my PDF from the regex test page:

Quote:

anyone else to know I was using it. Ball Tongue and I would go score some on the sly, after band practice or some other time when the rest of the guys weren’t around. It went on that way for a few 50
the final piece
months, until I found out something interesting: Munky was doing speed on the days he was off, too. 
And guess what? So was Jonathan.

The red is the part I want to get rid of. That's the page number (footer of the book) and the chapter or title of (header of each page). If I separate them into two, I can't come up with a header regex that works for removing the header line, because it matches the content (some characters and then a ).

Here is what I tried, and have been converting it to TXT format for quick viewing, though EPUB results the same:

(?ism)\d+.*?$

(?m)(\d+.*?)

(?mi)(\d+.*?$)

(?mi)(\d+$^.*?)

...and many other variants...

I based these ideas on the regex given on page 1 that was said to work for multi-line, but I can't figure it out. I'm sure it's something obvious that I'm doing wrong, too. Can anyone help?

adolson · 04-30-2010, 10:03 PM

OK, so, my regex looks right... In the regex tester of the new release, .51, it highlights EXACTLY what I want to remove.

(?ism)(\d+.*?)

However, it doesn't actually work when I do the conversion, and yes, I did check off the box...

Edit: OK, I can't actually seem to get any regex to work, now. Do I need to install something in particular for it to work?

pepak · 05-01-2010, 12:26 AM

1) If it did work, it would be quite dangerous - it could easily remove text you don't want removed.

2) I still don't understand why most of you people keep using the "m" flag, which is NOT suitable for the usage cases displayed in this thread. For example, Adolson's regexp should only use ?is, not ?ism.

adolson · 05-01-2010, 01:40 AM

Quote:

Originally Posted by pepak

1) If it did work, it would be quite dangerous - it could easily remove text you don't want removed.

2) I still don't understand why most of you people keep using the "m" flag, which is NOT suitable for the usage cases displayed in this thread. For example, Adolson's regexp should only use ?is, not ?ism.

1) When I put in my query, the Regex Builder highlights only stuff that I want removed. So therefore, it should work, right?

2) I put the m because chaley's post indicated a regex that was supposed to work, and I based mine on that. I tried without the m as well, it doesn't work either.

This one appears to have worked in vim, using the html generated by the debug output.
:%s/[0-9]\{1,}<\/p>\n$\s*\S$\{5,}<\/p>//g

kovidgoyal · 05-01-2010, 10:24 AM

Note that the HTML displayed in the regex builder is not absolutely identical to the html that is used in the conversion process, especially with regard to whitespace. So you have to make your regex tolerate differences in whitespace.

chaley · 05-01-2010, 11:57 AM

Quote:

Originally Posted by pepak

2) I still don't understand why most of you people keep using the "m" flag, which is NOT suitable for the usage cases displayed in this thread. For example, Adolson's regexp should only use ?is, not ?ism.

Why is it not suitable? Multiline makes '^' and '$' match around internal newlines (see here for details), something useful and in fact suitable for the regexp I used to solve the OP's problem. What behavior are you objecting to?

prky · 05-01-2010, 09:54 PM

Quote:

Originally Posted by pepak

I am not surprised. I told you not to use ^ and $.

Cheers for that.

Following your suggestions, I looked up the behaviour of ^ and $ in python when using multi-line regex, and found that whilst they do also apply to the start / termination of a string, they also still apply to the start / end of a line.

In combination with the s attribute (which makes .* match across multiple lines) and the suggestion of .*? (for a minimalist match), I had a regex which worked in the python tester.

The bit I was missing was that I wasn't testing each regex on the conversion each time, as I was using the highlighting in the regex builder to see if it was matching.

Thanks for your assistance.

prk.

prky · 05-01-2010, 09:56 PM

Quote:

Originally Posted by chaley

There is in fact a problem here, but not the one the OP was suggesting.

Second (the problem): the regexp tester does not show multi-line matches. The problem is that it uses the QSyntaxHighlighter, which my experimentation shows to be a line-oriented interface, making highlighting multiline matches impossible. I think that the regex texter should match directly against the text in the QTextEdit box, using something like setTextBackgroundColor to indicate matches. I admit that I haven't hacked the code to try this idea, but it seems plausible. I will file a ticket on this so that someone more acquainted than I am with the widgets can think about a solution.

Sweet.

Thank you so much for diagnosing that, and working out it's the display in the tester which was the issue (once I'd eventually got a working regex).

Much appreciated.

prk.

04-30-2010, 10:03 PM	#20
adolson Member Posts: 15 Karma: 10 Join Date: Apr 2010 Device: PRS-300	OK, so, my regex looks right... In the regex tester of the new release, .51, it highlights EXACTLY what I want to remove. (?ism)(\d+</p><p>.?</p><p>) However, it doesn't actually work when I do the conversion, and yes, I did check off the box... Edit: OK, I can't actually seem to get any regex to work, now. Do I need to install something in particular for it to work? Last edited by adolson; 05-01-2010 at 12:23 AM.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HTML Conversion - Multiline Headers	prky	Calibre	1	07-03-2010 09:24 AM
What a regex is	Worldwalker	Calibre	20	05-10-2010 05:51 AM
Help with a regex	A.T.E.	Calibre	1	04-05-2010 07:50 AM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 04:23 AM
Regex help...	Bobthebass	Workshop	6	04-26-2009 03:54 PM

04-30-2010, 01:21 AM	#17
tonyx3 Connoisseur Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One	Ok, so I suppose I didn't word that very tactfully. I was just referring that until you came along with an explanation of what the technical cause of the issue was, we all were just applying out knowledge of using regex, but none of us could get it to work, and none of us knew why. And it's been an issue in multiple threads, and a couple bug tickets, so it's not just an isolated issue in this thread. In any case, thanks for spending your time on it.

04-30-2010, 10:47 AM	#18
chaley Grand Sorcerer Posts: 11,742 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	I gave a fix for the regexp tester to Kovid (ticket #5414). I imagine that the fix (or something like it) will be released without too much delay.

05-01-2010, 12:26 AM	#21
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	1) If it did work, it would be quite dangerous - it could easily remove text you don't want removed. 2) I still don't understand why most of you people keep using the "m" flag, which is NOT suitable for the usage cases displayed in this thread. For example, Adolson's regexp should only use ?is, not ?ism.

05-01-2010, 10:24 AM	#23
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Note that the HTML displayed in the regex builder is not absolutely identical to the html that is used in the conversion process, especially with regard to whitespace. So you have to make your regex tolerate differences in whitespace.

Advert

Advert