Need help with a conversion regex - can't match newline
I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.
When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer -
<p>
This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>
1234 Main Street, , Seattle, , 12345</p><p>
<img src="index-3_1.jpg"/></p>
(name and address changed for this post).
I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text.
<p>\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s*1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p>
but it did not work. I cannot seem to match the newline character after the <p> tag.
I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them?
Thanks!
|