Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 12-17-2010, 04:13 PM   #1
winterminute
Junior Member
winterminute began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
My RegEx isn't doing what I hoped to remove page numbers and a fixed string

The person who built the PDF I'm using used a trial version of some XML formatter which spits out some text on every page, but this is hidden in the PDF, but when I convert to ePUB it shows up. I figured I could just remove this using a RegEx on the Header/Footer, but no luck.

Code:
String:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br>

RegEx:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br>
I'd also like to remove page numbers and page titles, here's an example

Code:
String:
<A name=13></a><IMG src="index-13_1.jpg"><br>Title <br>11 <br>

RegEx:
<A name=[0-9][0-9][0-9]></a><IMG src="index-[0-9][0-9][0-9]_1.jpg"><br>Title <br>[0-9][0-9][0-9] <br>

Did I completely misunderstand how regular expressions work?
winterminute is offline   Reply With Quote
Old 12-17-2010, 04:18 PM   #2
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,223
Karma: 1333994
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Replace [0-9][0-9][0-9] with \d+

Your current expression says you must have exactly three digits - either use [0-9]+ or \d+ to say you want one or more digits.

Also, you might need to escape that minus sign after index - use \- instead of -

i.e. your final regex is:
Code:
<A name=\d+></a><IMG src="index\-\d+_1.jpg"><br>Title <br>\d+ <br>
kiwidude is offline   Reply With Quote
 
Enthusiast
Old 12-17-2010, 04:22 PM   #3
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,223
Karma: 1333994
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Forgot to answer your other query. Make sure in your regex you escape any regex characters like periods and brackets in your example. So anywhere you see a . or a ( or ) put a \ in front of it in your regex so it becomes \. \( \) etc.
kiwidude is offline   Reply With Quote
Old 12-17-2010, 04:40 PM   #4
winterminute
Junior Member
winterminute began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
Wow - That was a quick reply. Thank you. I made the changes, but still no dice. Turns out that the IMG tag only exists in the first few pages and I also realized that I can combine both the fixed string and the title/page.

So, here's what I'm working with:
Code:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br><hr><A name=14></a>Title <br>12 <br>
and here's my modified RegEx:
Code:
<a href="http://www\.antennahouse\.com">Antenna House XSL Formatter \(Evaluation\)  http://www\.antennahouse\.com</a><br><hr><A name=\d+></a>Title <br>\d+ <br>
However, the Test button doesn't appear to do anything which I assume means my RegEx is wrong. Am I missing some special chars? Is there a list of what needs to be escaped?
winterminute is offline   Reply With Quote
Old 12-17-2010, 06:35 PM   #5
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,223
Karma: 1333994
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Expression looks fine to me. When you hit the test button, what it should do is highlight in yellow all the places it found a match. So if you scroll the window to where you know it should find a match then you should see it turn yellow instantly when you click Test.

If you mean it does nothing in the epub output, make sure you remembered to actually tick the "Remove headers" checkbox itself, not just set the regular expression. I have made that mistake a few times...
kiwidude is offline   Reply With Quote
Old 12-17-2010, 08:54 PM   #6
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You should try replacing all empty space with \s+, \s*, or \s. I find that most times it won't work if you actually try to leave the empty space in.

It's also a good idea to build the regex in pieces so you can use the test function at each stage. If something goes wrong it's easier to figure it out that way.
ldolse is offline   Reply With Quote
Old 12-19-2010, 10:55 PM   #7
winterminute
Junior Member
winterminute began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
Alright, I'm an idiot. There were some CRLF that I was stripping out while editing. ldolse had a good suggestion for getting it working in smaller pieces and was what led me to the fact that the line breaks was where the mistake was.

Thanks for all the help.
winterminute is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
remove page numbers? JeanC Calibre 7 11-25-2010 04:13 AM
PDF -> MOBI: a string is added to the bottom of each page falconfoxxx Calibre 3 09-14-2010 01:28 AM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
Regex to remove header from PDF neonbible Calibre 4 09-07-2010 10:08 AM


All times are GMT -4. The time now is 05:56 PM.


MobileRead.com is a privately owned, operated and funded community.