View Single Post
Old 08-16-2011, 02:53 PM   #9
camilou
Junior Member
camilou began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2011
Device: Nook
Quote:
Originally Posted by droylynn View Post
I am trying to convert a different PDF and the author's name is on every other page header but the first name is on one line with a <br> then the surname is on the one below and mixed in with it is page numbers or what I think are page numbers.

I've tried following the examples by copying and pasting the whole thing then using [0-9] three times to cover the variables but it isn't working. Can anyone help with this sneaky problem?
Ok from what you've said it seems it's something like this:

Quote:
Lewis<br>
Carroll 23
Right? Well In that case what I'd use as regular expression would be:
Quote:
Lewis<br>\nCarroll\s\d{1,3}
Let's go through it:
"Lewis" - This matches the string "Lewis", nothing weird about this one.
"<br>" - Matches the "<br>" that is used to make a line break.
"\n" - Depending on how it looks you might have to include this or not. For instance, if your text looks like this:
Quote:
"Lewis<br>Carroll"
Then you don't need it. But if it looks like:
Quote:
"Lewis<br>
Carroll"
As you can see, Carroll is in a new line so you need to include the new line character, otherwise it won't be matched.
"Carroll\s" - matches the string "Carroll" followed by one whitespace character
"\d{1,3}" - matches numbers with 1 to 3 digits. It'd then match 11, but not 1234. I set that to three because must books have less than 999 pages. If you have a really long book you can add another digit by changing the text inside braces to "{1,4}". Same thing if you have a shorter book with less than 99 pages.

And that's it, I think. Try that one and tell me how it went
camilou is offline   Reply With Quote