03-26-2010, 08:30 PM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Mar 2010
Device: Kindle
|
Need help with a conversion regex - can't match newline
I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.
When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer - <p> This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p> 1234 Main Street, , Seattle, , 12345</p><p> <img src="index-3_1.jpg"/></p> (name and address changed for this post). I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text. <p>\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s*1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p> but it did not work. I cannot seem to match the newline character after the <p> tag. I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them? Thanks! |
03-28-2010, 02:12 PM | #2 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
I had a similar problem in this thread. So far, I haven't been able to get it to work (though I have to admit I stopped trying a few days after making that last post).
|
Advert | |
|
03-29-2010, 10:58 AM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Mar 2010
Device: Kindle
|
It appears to me that the regex matching in calbre is broken. I was able to successfully edit the parsed html file with a regex that uses \n for a line break using a python script, but the same identical regex does not work when I use it through the calibre interface. Or from the calibre command line ebook-convert.
For anyone having issue with the header or footer regex in calibre, one way to get around it is to: 1. enable debug so the files for intermediate conversion steps are written to a directory, and then convert your ebook without worrying about the footer regex. 2. create your regex normally, but use it from a python prompt to modify the file. The file you want to modify is the index.html in the parsed directory from step 1. 3. load the modified index.html back into calibre as a new book. Then complete the conversion to whatever format you want from that book. Not very elegant, but it does work. I will submit a bug report on this topic. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Conversion and Regex Help | thedevilsjester | Calibre | 0 | 07-16-2010 06:10 PM |
eReader to match Amazon... more is always better! | Ceili | News | 18 | 07-01-2009 11:11 AM |
SonyStyle Price Match | Zen-Diego | Sony Reader | 3 | 05-06-2009 03:07 PM |
PID doesn't match this file | Bob Butler | Kindle Developer's Corner | 10 | 10-27-2008 02:17 PM |
Sony-Kindle Cage Match | Kingston | Sony Reader | 20 | 03-19-2008 06:36 AM |