View Single Post
Old 03-26-2010, 08:30 PM   #1
ereader123
Junior Member
ereader123 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2010
Device: Kindle
Need help with a conversion regex - can't match newline

I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.

When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer -

<p>
This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>
1234 Main Street, , Seattle, , 12345</p><p>
<img src="index-3_1.jpg"/></p>

(name and address changed for this post).

I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text.

<p>\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s*1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p>

but it did not work. I cannot seem to match the newline character after the <p> tag.

I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them?

Thanks!
ereader123 is offline   Reply With Quote