Help with a regex
I would like to remove a footer that in the HTML intermediate output looks like this:
<b>1</b></p><p>FOOTER - CHAPTER, PARAGRAPH AND PAGE</p><p>
As you can see, after the "<b>" tag there is the actual page number. Then after the first "<p>" there is a text which is the same (FOOTER) and after the "-" a text that changes (CHAPTER, PARAGRAPH AND PAGE), which prevents me from doing an easy "Remove All" command.
The input is a PDF file, the output can be an EPUB or whatever.
I have given a look at the regex documentation, but, as I have never done anything like that, that is a too steep mountain to climb for a starter.
Basically I would like to have a regex that tells calibre to remove an expression that starts with:
<b>
that ends with:
</p><p>
that contains between "<b>" and "</b>" a number that changes at every page
that contains "</b></p><p>FOOTER -"
and that contains between "-" and "</p><p>" something variable (text and numbers) whatever it is.
It should be just removed. Not replaced by anything.
If calibre cannot do it, is there a way to do it with a script running either on a Windows or Linux computer?
Thanks to those who will help me or at least put me in the right direction.
|