MobileRead Forums - View Single Post - Need help with a conversion regex

ereader123 · 03-26-2010, 08:30 PM

I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.

When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer -


This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010
1234 Main Street, , Seattle, , 12345
<img src="index-3_1.jpg"/>

(name and address changed for this post).

I can get the regex builder to recognize the text, but when I add the leading tag, it no longer recognizes the text. I thought the problem might be the new line between the and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text.

\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010\s*1234 Main Street, , Seattle, , 12345\s*<img src="index-\d_1.jpg"/>

but it did not work. I cannot seem to match the newline character after the tag.

I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them?

Thanks!

03-26-2010, 08:30 PM	#1
ereader123 Junior Member Posts: 4 Karma: 10 Join Date: Mar 2010 Device: Kindle	Need help with a conversion regex - can't match newline I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex. When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer - <p> This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p> 1234 Main Street, , Seattle, , 12345</p><p> <img src="index-3_1.jpg"/></p> (name and address changed for this post). I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text. <p>\sThis material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p> but it did not work. I cannot seem to match the newline character after the <p> tag. I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them? Thanks!