View Single Post
Old 04-05-2010, 05:13 AM   #1
A.T.E.
Junior Member
A.T.E. began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Mar 2010
Location: Switzerland
Device: Sony PRS-300
Help with a regex

I would like to remove a footer that in the HTML intermediate output looks like this:

<b>1</b></p><p>FOOTER - CHAPTER, PARAGRAPH AND PAGE</p><p>

As you can see, after the "<b>" tag there is the actual page number. Then after the first "<p>" there is a text which is the same (FOOTER) and after the "-" a text that changes (CHAPTER, PARAGRAPH AND PAGE), which prevents me from doing an easy "Remove All" command.

The input is a PDF file, the output can be an EPUB or whatever.

I have given a look at the regex documentation, but, as I have never done anything like that, that is a too steep mountain to climb for a starter.

Basically I would like to have a regex that tells calibre to remove an expression that starts with:

<b>

that ends with:

</p><p>

that contains between "<b>" and "</b>" a number that changes at every page

that contains "</b></p><p>FOOTER -"

and that contains between "-" and "</p><p>" something variable (text and numbers) whatever it is.

It should be just removed. Not replaced by anything.

If calibre cannot do it, is there a way to do it with a script running either on a Windows or Linux computer?

Thanks to those who will help me or at least put me in the right direction.
A.T.E. is offline   Reply With Quote