Need help with a conversion regex - can't match newline

ereader123 · 03-26-2010, 08:30 PM

I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.

When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer -


This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010
1234 Main Street, , Seattle, , 12345
<img src="index-3_1.jpg"/>

(name and address changed for this post).

I can get the regex builder to recognize the text, but when I add the leading tag, it no longer recognizes the text. I thought the problem might be the new line between the and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text.

\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010\s*1234 Main Street, , Seattle, , 12345\s*<img src="index-\d_1.jpg"/>

but it did not work. I cannot seem to match the newline character after the tag.

I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them?

Thanks!

Manichean · 03-28-2010, 02:12 PM

I had a similar problem in this thread. So far, I haven't been able to get it to work (though I have to admit I stopped trying a few days after making that last post).

ereader123 · 03-29-2010, 10:58 AM

It appears to me that the regex matching in calbre is broken. I was able to successfully edit the parsed html file with a regex that uses \n for a line break using a python script, but the same identical regex does not work when I use it through the calibre interface. Or from the calibre command line ebook-convert.

For anyone having issue with the header or footer regex in calibre, one way to get around it is to:
1. enable debug so the files for intermediate conversion steps are written to a directory, and then convert your ebook without worrying about the footer regex.
2. create your regex normally, but use it from a python prompt to modify the file. The file you want to modify is the index.html in the parsed directory from step 1.
3. load the modified index.html back into calibre as a new book. Then complete the conversion to whatever format you want from that book.

Not very elegant, but it does work.

I will submit a bug report on this topic.

03-26-2010, 08:30 PM	#1
ereader123 Junior Member Posts: 4 Karma: 10 Join Date: Mar 2010 Device: Kindle	Need help with a conversion regex - can't match newline I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex. When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer - <p> This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p> 1234 Main Street, , Seattle, , 12345</p><p> <img src="index-3_1.jpg"/></p> (name and address changed for this post). I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text. <p>\sThis material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p> but it did not work. I cannot seem to match the newline character after the <p> tag. I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them? Thanks!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Conversion and Regex Help	thedevilsjester	Calibre	0	07-16-2010 06:10 PM
eReader to match Amazon... more is always better!	Ceili	News	18	07-01-2009 11:11 AM
SonyStyle Price Match	Zen-Diego	Sony Reader	3	05-06-2009 03:07 PM
PID doesn't match this file	Bob Butler	Kindle Developer's Corner	10	10-27-2008 02:17 PM
Sony-Kindle Cage Match	Kingston	Sony Reader	20	03-19-2008 06:36 AM

03-28-2010, 02:12 PM	#2
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	I had a similar problem in this thread. So far, I haven't been able to get it to work (though I have to admit I stopped trying a few days after making that last post).

03-29-2010, 10:58 AM	#3
ereader123 Junior Member Posts: 4 Karma: 10 Join Date: Mar 2010 Device: Kindle	It appears to me that the regex matching in calbre is broken. I was able to successfully edit the parsed html file with a regex that uses \n for a line break using a python script, but the same identical regex does not work when I use it through the calibre interface. Or from the calibre command line ebook-convert. For anyone having issue with the header or footer regex in calibre, one way to get around it is to: 1. enable debug so the files for intermediate conversion steps are written to a directory, and then convert your ebook without worrying about the footer regex. 2. create your regex normally, but use it from a python prompt to modify the file. The file you want to modify is the index.html in the parsed directory from step 1. 3. load the modified index.html back into calibre as a new book. Then complete the conversion to whatever format you want from that book. Not very elegant, but it does work. I will submit a bug report on this topic.

Advert