Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-27-2011, 03:36 PM   #1
captainslow
Junior Member
captainslow began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: Kindle
RegEx: Removing Page Numbers that have Spaces

I've tried my best to solve a problem I'm currently facing and I can't figure it out.

I've going from PDF to ePub and I'm trying to remove page numbers. I've gone into search and replace in the PDF and I've found the following:

<hr>
<A name=258></a>[page numbers with spaces] <br>

So every time the page number is listed the numbers are separated with spaces along with two trailing spaces. For example, an actual entry is as follows:

<hr>
<A name=258></a>2 4 8 <br>

I can't figure out how to have Calibre simply find those page numbers and remove them. What I wan't it to do is either:

1. search for the </a> and <br> and ignore what's in between them. That way it doesn't matter how many digits and spaces are in between those two tags

2. Tell Calibre to search for anything that has one digit, OR two digits, OR three digits. That'll get rid of everything.

I've come up with this that clearly doesn't work:

<hr>\n<A name=\d{1,3}></a>\d\s\d\s\d\s\s<br>

The only problem with that is that it will only search for entries that it finds with three digits. I don't know how to make it search for one digit, or two, or three, or X.

The <A name=???> is easy because there are no spaces but once spaces are introduced I can't wrap my head around it. Any help would be awesome!
captainslow is offline   Reply With Quote
Old 02-27-2011, 03:40 PM   #2
captainslow
Junior Member
captainslow began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: Kindle
Well, I tried .* and that seemed to work:

<hr>\n<A name=\d{1,3}></a>.*<br>

I'm not sure why it works so would anyone be able to explain?

My understanding is that . matches any one character and * matches any of the previous character so .* would be saying "please match any characters"

Last edited by captainslow; 02-27-2011 at 03:42 PM.
captainslow is offline   Reply With Quote
Advert
Old 02-27-2011, 04:14 PM   #3
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Using .* can be a little dangerous, as the asterisk on its own tries to match as much text as possible. Try using .*? which will match as little text as possible (the other behaviour is called greedy). You could also just use a character set, for example, using
Code:
<hr>\n<A name=\d{1,3}></a>[0-9 ]+<br>
(Notice the space in the set) should work.
A note on your understanding of ".*": The dot matches any character (except for the newline, which requires a flag), and the asterisk extends matching of the previous expression by matching 0 or more of the previous expression. If you use the plus sign as quantifier, you'll match 1 or more of the previous expression, and the question mark matches 0 or 1 of the previous expression (except when used after another quantifier like above).

Edit: I guess a still safer way to write the expression would be
Code:
<hr>\s+<A name=\d{1,3}></a>[0-9 ]+<br>
(Notice the \s+ after the <hr>) since that will match differently encoded line breaks, while your expression only matches linebreaks encoded only by a newline. Might be academical, though

Last edited by Manichean; 02-27-2011 at 04:17 PM.
Manichean is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Removing page numbers. ChaoZ Calibre 8 10-20-2014 03:02 PM
My RegEx isn't doing what I hoped to remove page numbers and a fixed string winterminute Calibre 6 12-19-2010 10:55 PM
Removing headers/page numbers greycobalt Calibre 3 10-10-2010 01:57 PM
Removing Page Numbers ManosHandsOfFate Calibre 6 09-28-2010 12:12 PM
Removing page numbers? Cap.T Calibre 1 02-21-2010 09:57 AM


All times are GMT -4. The time now is 02:35 AM.


MobileRead.com is a privately owned, operated and funded community.