MobileRead Forums - View Single Post

A.T.E. · 04-05-2010, 05:13 AM

I would like to remove a footer that in the HTML intermediate output looks like this:

1FOOTER - CHAPTER, PARAGRAPH AND PAGE

As you can see, after the "" tag there is the actual page number. Then after the first "" there is a text which is the same (FOOTER) and after the "-" a text that changes (CHAPTER, PARAGRAPH AND PAGE), which prevents me from doing an easy "Remove All" command.

The input is a PDF file, the output can be an EPUB or whatever.

I have given a look at the regex documentation, but, as I have never done anything like that, that is a too steep mountain to climb for a starter.

Basically I would like to have a regex that tells calibre to remove an expression that starts with:



that ends with:



that contains between "" and "" a number that changes at every page

that contains "FOOTER -"

and that contains between "-" and "" something variable (text and numbers) whatever it is.

It should be just removed. Not replaced by anything.

If calibre cannot do it, is there a way to do it with a script running either on a Windows or Linux computer?

Thanks to those who will help me or at least put me in the right direction.

04-05-2010, 05:13 AM	#1
A.T.E. Member Posts: 14 Karma: 10 Join Date: Mar 2010 Location: Switzerland Device: Kobo Clara HD, Kobo Aura H2O, Sony PRS-300, FBReader	Help with a regex I would like to remove a footer that in the HTML intermediate output looks like this: <b>1</b></p><p>FOOTER - CHAPTER, PARAGRAPH AND PAGE</p><p> As you can see, after the "<b>" tag there is the actual page number. Then after the first "<p>" there is a text which is the same (FOOTER) and after the "-" a text that changes (CHAPTER, PARAGRAPH AND PAGE), which prevents me from doing an easy "Remove All" command. The input is a PDF file, the output can be an EPUB or whatever. I have given a look at the regex documentation, but, as I have never done anything like that, that is a too steep mountain to climb for a starter. Basically I would like to have a regex that tells calibre to remove an expression that starts with: <b> that ends with: </p><p> that contains between "<b>" and "</b>" a number that changes at every page that contains "</b></p><p>FOOTER -" and that contains between "-" and "</p><p>" something variable (text and numbers) whatever it is. It should be just removed. Not replaced by anything. If calibre cannot do it, is there a way to do it with a script running either on a Windows or Linux computer? Thanks to those who will help me or at least put me in the right direction.