Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 03-26-2010, 08:30 PM   #1
ereader123
Junior Member
ereader123 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2010
Device: Kindle
Need help with a conversion regex - can't match newline

I am trying to clean up a pdf from Packt to read on my kindle. The first problem is an annoying footer on every page that has text and an image. I cannot get the regex to work, even though it matches when I test it in pyreb. BTW, pyreb is a cool regex editor using python regex.

When I clicked on the wizard symbol for the footer regex, this is what I see in the regex editor for the offending footer -

<p>
This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>
1234 Main Street, , Seattle, , 12345</p><p>
<img src="index-3_1.jpg"/></p>

(name and address changed for this post).

I can get the regex builder to recognize the text, but when I add the leading <p> tag, it no longer recognizes the text. I thought the problem might be the new line between the <p> and the word This, so I tried a \s. Also, I noticed that the name of the image changes, so I added a \d in the image tag to match all of the image tags that has this preceding text.

<p>\s*This material is copyright and is licensed for the sole use by Sam Smith on 19th March 2010</p><p>\s*1234 Main Street, , Seattle, , 12345</p><p>\s*<img src="index-\d_1.jpg"/></p>

but it did not work. I cannot seem to match the newline character after the <p> tag.

I looked at the parsed output (which matched the html in the regex builder) in a hex editor and the character after the <p> is 0A, or a newline, which should be matched by \s. There are two newlines in the pattern I need to match - how do I match them?

Thanks!
ereader123 is offline   Reply With Quote
Old 03-28-2010, 02:12 PM   #2
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
I had a similar problem in this thread. So far, I haven't been able to get it to work (though I have to admit I stopped trying a few days after making that last post).
Manichean is offline   Reply With Quote
Advert
Old 03-29-2010, 10:58 AM   #3
ereader123
Junior Member
ereader123 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2010
Device: Kindle
It appears to me that the regex matching in calbre is broken. I was able to successfully edit the parsed html file with a regex that uses \n for a line break using a python script, but the same identical regex does not work when I use it through the calibre interface. Or from the calibre command line ebook-convert.

For anyone having issue with the header or footer regex in calibre, one way to get around it is to:
1. enable debug so the files for intermediate conversion steps are written to a directory, and then convert your ebook without worrying about the footer regex.
2. create your regex normally, but use it from a python prompt to modify the file. The file you want to modify is the index.html in the parsed directory from step 1.
3. load the modified index.html back into calibre as a new book. Then complete the conversion to whatever format you want from that book.

Not very elegant, but it does work.

I will submit a bug report on this topic.
ereader123 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Conversion and Regex Help thedevilsjester Calibre 0 07-16-2010 06:10 PM
eReader to match Amazon... more is always better! Ceili News 18 07-01-2009 11:11 AM
SonyStyle Price Match Zen-Diego Sony Reader 3 05-06-2009 03:07 PM
PID doesn't match this file Bob Butler Kindle Developer's Corner 10 10-27-2008 02:17 PM
Sony-Kindle Cage Match Kingston Sony Reader 20 03-19-2008 06:36 AM


All times are GMT -4. The time now is 01:28 AM.


MobileRead.com is a privately owned, operated and funded community.