This is driving me nuts...

Spit · 08-15-2009, 06:13 AM

Go my new Sony 505 yesterday

Loving the hardware so far.

I have a few PDF's I am trying to convert and it is becoming very frustrating. I put another post on yesterday but I have moved on a bit since.

I want to remove the header and footer. I have done alot of reading, searching and playing. This is what I have done so far:

I use NitroPDF to crop the PDF (all pages option) to remove the header which has chapter title in and the footer which has the page numbers in.

I then use calibre to convert them to either Epub or LRF (tried both several times).

Copy the file to the reader using explorer.

THE HEADER AND FOOTER TEXT IS STILL IN THE BOOK ALL THE WAY THROUGH!!

If i load my cropped PDF into Adobe Reader there is no header and footer. Where is it getting the information from?

I have also tried reinstalling Calibre in case it was picking anything up from when I wasnt using NitroPDF, still no joy.

All I want to do is read without the header and footer all the way through the text, LOL

user_none · 08-15-2009, 07:47 AM

The PDF cropping application you are using is not removing the cropped portion from the file. It is just hiding it. Think of it this way. You have a page, within the page is a display area. The application is just making the display area smaller so that the header and footer are not shown when you view the file but they are still there.

calibre includes the ability to remove headers and footers based on regular expressions. However, I haven't gotten around to producing any GUI tools to make it easier. The process right now is a bit difficult but, it is ebook-convert in.pdf .epub --debug-input ./out_dir, look at the html produced, create a regex, convert using the --remove-header, --remove-footer and set the header and footer regexs.

Spit · 08-15-2009, 07:52 AM

Thanks. That makes sense for the crop. No way of deleting this info then?

So convert my PDF to html 1st then convert to epub and use remove header and footer?

user_none · 08-15-2009, 07:55 AM

Remove header and footer is supported by all formats. What happens is: the input (no matter what format) is converted to html and run though a preprocessor before being turned into an internal OEB book. The preprocessor applies the regex to remove the header and footer as well as doing other things. The OEB is then turned into what ever your output format is.

PM me your PDF and I'll look at it to see what kind of regex you will need to remove the header and footer.

JIGACE · 08-15-2009, 12:19 PM

hi... i used to have the same problem with pdf files ... im not expert but i think the only way to remove those are with either a software to convert pdf to word and then remove the header& footer manually or to use a regular expression i think, anyway i found long time ago a program that converts files to pdb files (palm ebooks) and it does it on 3 steps first it converts the pdf file to txt, then cleans the txt (the result may be wrong but almost every time deletes header & footer) and then converts this clean txt file to pdb, but you can use the clean txt file and import it to calibre (it what i do) so... it maybe suit you... hope i make myself clear cause my english ... not very good... anyway here's the link http://www.reblusoft.com/index.php?o...d=20&Itemid=49

DerSchwarzePrinz · 08-16-2009, 11:53 AM

Quote:

Originally Posted by user_none

Remove header and footer is supported by all formats. What happens is: the input (no matter what format) is converted to html and run though a preprocessor before being turned into an internal OEB book. The preprocessor applies the regex to remove the header and footer as well as doing other things. The OEB is then turned into what ever your output format is.

PM me your PDF and I'll look at it to see what kind of regex you will need to remove the header and footer.

What regular expression should I use for

- xxx -

(xxx = page number, one to three digits)?

I tried to use the following (simple) RE:

\- [0-9]+ \-

But it removes only

- x - (- 1 - to - 9 -)
and
- xxx - (- 100 - to - 999 -)

but not

- xx - (- 10 - to - 99 -)

Whats wrong?

Thanks a lot for your help,
DSP

user_none · 08-16-2009, 01:28 PM

Quote:

Originally Posted by DerSchwarzePrinz

What regular expression should I use for

- xxx -

(xxx = page number, one to three digits)?

Code:

- \d{1,3} -

Also, you might find this useful.

DerSchwarzePrinz · 08-16-2009, 03:54 PM

Quote:

Originally Posted by user_none

Code:

- \d{1,3} -

Also, you might find this useful.

Same effect, only one and three digit numbers are removed. Any ideas?

If I use "\d{1,3}" all the page numbers are removed, but naturally the hyphens stay behind.
If I use "- \d{1,3} -" all page numbers including hyphens are removed without the page numbers that have two digits.

Spit · 08-16-2009, 04:56 PM

I sorted it another way, thanks for the help guys.

I used NitroPDF to export the PDF to word and chose the delete header and footer info. Then with word I saved the new doc to pdf. Then I used calibre to convert to EpuB. Long winded but worked a treat. seeing as i will only be converting 4-5 books a month its ok.

Ralob · 08-16-2009, 05:37 PM

I use Abbyy FR9 to scan the pdf then I export the pdf to html. Then, I use snowsoft's htmlbookfixer to clean everything up. Works perfectly.

user_none · 08-16-2009, 07:52 PM

Quote:

Originally Posted by DerSchwarzePrinz

Same effect, only one and three digit numbers are removed. Any ideas?

If I use "\d{1,3}" all the page numbers are removed, but naturally the hyphens stay behind.
If I use "- \d{1,3} -" all page numbers including hyphens are removed without the page numbers that have two digits.

Hm... It could be an extra space somewhere between the -s. As long as there are no spaces between the numbers themselves this should work:

Code:

"-\s*\d{1,3}\s*-"

Quote:

Originally Posted by Spit

I sorted it another way, thanks for the help guys.

Great. As you can tell the header removal is very rough at the moment. Eventually I'm going to write a GUI tool to help. Also, eventually I want to have it auto detect the headers instead of having to use a regex. But both of those are toward the bottom of my todo list.

doreenjoy · 08-16-2009, 07:59 PM

I have had the same problem and never successfully got Calibre to remove the headers (some of the footers in particular are complicated, with 2 lines of footer). So I used the Adobe Acrobat cropping feature, saved as PDF, and suffered through reading the PDF.

RandallFlagg · 08-16-2009, 08:10 PM

I don't know about the software your using, but I use acrobat pro and usually I crop the header and footer first. Then I go to the document tab and choose examine document. After it searches the doc, it list the items it has found. I uncheck metadata and bookmarks as I don't usually want those removed. I do however wish to remove hidden data, which is the header and footers I can no longer view. After that I can convert it with Calibre. The software your using should offer a way to remove the hidden data once you have cropped your document. I hope this helps

DerSchwarzePrinz · 08-17-2009, 10:10 AM

Quote:

Originally Posted by user_none

Hm... It could be an extra space somewhere between the -s. As long as there are no spaces between the numbers themselves this should work:

Code:

"-\s*\d{1,3}\s*-"

Nope, doesn't work. I tried this:

"-\s*\d.+-"

That seems to work ...

What RE could I use for "normal" page numbers e. g. 1, 2, 3 ...?
I want to avoid removing numbers from the document, so I need something like "remove any number from one to three digits on a single line". Could you please give me a hint?

dionymnia · 09-02-2009, 01:44 AM

Quote:

Originally Posted by RandallFlagg

I don't know about the software your using, but I use acrobat pro and usually I crop the header and footer first. Then I go to the document tab and choose examine document. After it searches the doc, it list the items it has found. I uncheck metadata and bookmarks as I don't usually want those removed. I do however wish to remove hidden data, which is the header and footers I can no longer view. After that I can convert it with Calibre. The software your using should offer a way to remove the hidden data once you have cropped your document. I hope this helps

This worked perfectly for me - thanks for suggesting it!

- D

08-15-2009, 06:13 AM	#1
Spit Member Posts: 10 Karma: 10 Join Date: Aug 2009 Device: none	This is driving me nuts... Go my new Sony 505 yesterday Loving the hardware so far. I have a few PDF's I am trying to convert and it is becoming very frustrating. I put another post on yesterday but I have moved on a bit since. I want to remove the header and footer. I have done alot of reading, searching and playing. This is what I have done so far: I use NitroPDF to crop the PDF (all pages option) to remove the header which has chapter title in and the footer which has the page numbers in. I then use calibre to convert them to either Epub or LRF (tried both several times). Copy the file to the reader using explorer. THE HEADER AND FOOTER TEXT IS STILL IN THE BOOK ALL THE WAY THROUGH!! If i load my cropped PDF into Adobe Reader there is no header and footer. Where is it getting the information from? I have also tried reinstalling Calibre in case it was picking anything up from when I wasnt using NitroPDF, still no joy. All I want to do is read without the header and footer all the way through the text, LOL

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Unutterably Silly Unsalted Nuts	recluse	Lounge	19	04-30-2010 07:06 PM
Going Nuts with Calibre	saitekx36	Calibre	17	06-10-2009 06:04 PM
Short Fiction Ebers, Georg: The Nuts. V1. 28 Mar 2009	crutledge	Kindle Books	0	03-28-2009 07:57 AM
PRS-505 Date reset driving me nuts!	nathantw	Sony Reader	2	07-11-2008 04:00 PM

08-15-2009, 07:47 AM	#2
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	The PDF cropping application you are using is not removing the cropped portion from the file. It is just hiding it. Think of it this way. You have a page, within the page is a display area. The application is just making the display area smaller so that the header and footer are not shown when you view the file but they are still there. calibre includes the ability to remove headers and footers based on regular expressions. However, I haven't gotten around to producing any GUI tools to make it easier. The process right now is a bit difficult but, it is ebook-convert in.pdf .epub --debug-input ./out_dir, look at the html produced, create a regex, convert using the --remove-header, --remove-footer and set the header and footer regexs.

08-15-2009, 07:52 AM	#3
Spit Member Posts: 10 Karma: 10 Join Date: Aug 2009 Device: none	Thanks. That makes sense for the crop. No way of deleting this info then? So convert my PDF to html 1st then convert to epub and use remove header and footer?

08-15-2009, 07:55 AM	#4
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	Remove header and footer is supported by all formats. What happens is: the input (no matter what format) is converted to html and run though a preprocessor before being turned into an internal OEB book. The preprocessor applies the regex to remove the header and footer as well as doing other things. The OEB is then turned into what ever your output format is. PM me your PDF and I'll look at it to see what kind of regex you will need to remove the header and footer.

08-15-2009, 12:19 PM	#5
JIGACE Member Posts: 21 Karma: 10 Join Date: Jul 2008 Device: EZ Reader Pocket Pro	hi... i used to have the same problem with pdf files ... im not expert but i think the only way to remove those are with either a software to convert pdf to word and then remove the header& footer manually or to use a regular expression i think, anyway i found long time ago a program that converts files to pdb files (palm ebooks) and it does it on 3 steps first it converts the pdf file to txt, then cleans the txt (the result may be wrong but almost every time deletes header & footer) and then converts this clean txt file to pdb, but you can use the clean txt file and import it to calibre (it what i do) so... it maybe suit you... hope i make myself clear cause my english ... not very good... anyway here's the link http://www.reblusoft.com/index.php?o...d=20&Itemid=49

08-16-2009, 04:56 PM	#9
Spit Member Posts: 10 Karma: 10 Join Date: Aug 2009 Device: none	I sorted it another way, thanks for the help guys. I used NitroPDF to export the PDF to word and chose the delete header and footer info. Then with word I saved the new doc to pdf. Then I used calibre to convert to EpuB. Long winded but worked a treat. seeing as i will only be converting 4-5 books a month its ok.

08-16-2009, 05:37 PM	#10
Ralob Connoisseur Posts: 53 Karma: 10 Join Date: Feb 2008 Device: iPad Pro, Kobo Libra 2, PW4	I use Abbyy FR9 to scan the pdf then I export the pdf to html. Then, I use snowsoft's htmlbookfixer to clean everything up. Works perfectly.

08-16-2009, 07:59 PM	#12
doreenjoy 01000100 01001010 Posts: 1,889 Karma: 2400000 Join Date: Mar 2009 Device: Polyamorous	I have had the same problem and never successfully got Calibre to remove the headers (some of the footers in particular are complicated, with 2 lines of footer). So I used the Adobe Acrobat cropping feature, saved as PDF, and suffered through reading the PDF.

08-16-2009, 08:10 PM	#13
RandallFlagg Hey Trashcan Man Posts: 66 Karma: 658 Join Date: Jan 2008 Location: So Cal Device: Nook color, prs 505, Axim x30, psp, Acer Aspire One [running xp]	I don't know about the software your using, but I use acrobat pro and usually I crop the header and footer first. Then I go to the document tab and choose examine document. After it searches the doc, it list the items it has found. I uncheck metadata and bookmarks as I don't usually want those removed. I do however wish to remove hidden data, which is the header and footers I can no longer view. After that I can convert it with Calibre. The software your using should offer a way to remove the hidden data once you have cropped your document. I hope this helps

Advert

Advert