Removing headers from pdf file

fotobox · 08-29-2010, 05:52 AM

Hi All

I apologise in advance if this has been asked before, but I've been trying to get what is probably quite a straightforward header and footer removed from a pdf when I convert it to .mobi

If I go to the remove header/footer wizard the code shows (for example; the number increases each page):

Page 1

and also (for example; the number increases each page):

What do I need to enter into the line of code in the wizard to remove these?

Cheers in advance!

Adoby · 08-29-2010, 07:44 AM

I will give you several answers:

Answer 1:
Just write a regular expression that match the header/footer.

Answer 2:
Learn how to write regular expressions, and then try Answer 1 above. It is a really fun and exciting skill to have: http://docs.python.org/library/re.html#re-syntax

Answer 3:
Try to ask on a online forum, and hope that someone can be bothered to answer you.

Answer 4:
Follow these directions.

I did a footer removal on a PDF. The footer looks like this:

Code:

_<br>
www.eboat.net                                              Page                                                           eboat.net<br>
1<br>
<hr>

The pattern to match this might be built using the following parts:

() surround a part to match. So we need () around it all.

The first part is easy:

(_ )

This match the first line. In order to skip ahead to the next line we match some whitespace. Tabs, spaces and newlines. A special code exists for this, namely \s. There may not be any whitespace, or a lot. A * after a pattern will make the pattern match from 0 to many times. So now we have:

(_ \s*)

The next few parts should now be obvious:

(_ \s*www.eboat.net\s*Page\s*eboat.net \s*)

Next comes a page number. That can change so we match one or more digits instead. The code \d will match a digit. If we add a + it will match one or more digits:

(_ \s*www.eboat.net\s*Page\s*eboat.net \s*\d +)

Just a few finishing touches, and we are done:

(_ \s*www.eboat.net\s*Page\s*eboat.net \s*\d +\s* \s*<hr>)

I don't actually remember all these codes. I look them up when I need them. But some I do remember. Write a few regular expressions, and it will become easier every time.

http://docs.python.org/library/re.html#re-syntax

Now it should be easy for you to write your own regular expression that match your examples.

One of them would be (have not tested, so it could be wrong):

(Page\s*\d+ )

The other:

(<A\s*name=\d+></a>)

But this might actually be useful to keep, to allow navigation in the book. It is a bookmark that you can navigate to from a table of contents.

Regular expressions are written using a rudimentary language, with synonyms and many different ways to express the same thing. Some ways may be better/smarter/prettier/more robust than others.

fotobox · 08-30-2010, 03:59 AM

@ Adoby: Worked a treat, thanks so much!

08-29-2010, 05:52 AM	#1
fotobox Junior Member Posts: 2 Karma: 10 Join Date: Aug 2010 Device: Kindle	Removing headers from pdf file Hi All I apologise in advance if this has been asked before, but I've been trying to get what is probably quite a straightforward header and footer removed from a pdf when I convert it to .mobi If I go to the remove header/footer wizard the code shows (for example; the number increases each page): <b>Page 1</b><br> and also (for example; the number increases each page): <A name=6></a> What do I need to enter into the line of code in the wizard to remove these? Cheers in advance!

08-29-2010, 07:44 AM	#2
Adoby Handy Elephant Posts: 1,736 Karma: 26785668 Join Date: Dec 2009 Location: Southern Sweden, far out in the quiet woods Device: Thinkpad E595, Ubuntu Mate, Huawei Mediapad 5, Bouye Likebook Plus	I will give you several answers: Answer 1: Just write a regular expression that match the header/footer. Answer 2: Learn how to write regular expressions, and then try Answer 1 above. It is a really fun and exciting skill to have: http://docs.python.org/library/re.html#re-syntax Answer 3: Try to ask on a online forum, and hope that someone can be bothered to answer you. Answer 4: Follow these directions. I did a footer removal on a PDF. The footer looks like this: Code: _<br> www.eboat.net Page eboat.net<br> 1<br> <hr> The pattern to match this might be built using the following parts: () surround a part to match. So we need () around it all. The first part is easy: (_<br>) This match the first line. In order to skip ahead to the next line we match some whitespace. Tabs, spaces and newlines. A special code exists for this, namely \s. There may not be any whitespace, or a lot. A * after a pattern will make the pattern match from 0 to many times. So now we have: (_<br>\s) The next few parts should now be obvious: (_<br>\swww.eboat.net\sPage\seboat.net<br>\s) Next comes a page number. That can change so we match one or more digits instead. The code \d will match a digit. If we add a + it will match one or more digits: (_<br>\swww.eboat.net\sPage\seboat.net<br>\s\d +) Just a few finishing touches, and we are done: (_<br>\swww.eboat.net\sPage\seboat.net<br>\s\d +\s<br>\s<hr>) I don't actually remember all these codes. I look them up when I need them. But some I do remember. Write a few regular expressions, and it will become easier every time. http://docs.python.org/library/re.html#re-syntax Now it should be easy for you to write your own regular expression that match your examples. One of them would be (have not tested, so it could be wrong): (<b>Page\s\d+</b><br>) The other: (<A\sname=\d+></a>) But this might actually be useful to keep, to allow navigation in the book. It is a bookmark that you can navigate to from a table of contents. Regular expressions are written using a rudimentary language, with synonyms and many different ways to express the same thing. Some ways may be better/smarter/prettier/more robust than others. Last edited by Adoby; 08-29-2010 at 07:58 AM.*

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Removing Headers/Footers Help?	Anarel	Workshop	10	11-09-2010 12:53 PM
Removing headers/page numbers	greycobalt	Calibre	3	10-10-2010 01:57 PM
Pls help with removing headers /footers	Mamaijee	Calibre	0	09-19-2010 01:29 PM
Removing Headers - yet again	jjansen	Calibre	1	02-18-2010 05:24 PM
Scanning and removing footers/headers	monsieurms	Workshop	8	12-14-2009 06:12 PM

08-30-2010, 03:59 AM	#3
fotobox Junior Member Posts: 2 Karma: 10 Join Date: Aug 2010 Device: Kindle	@ Adoby: Worked a treat, thanks so much!

Advert