Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2009, 06:13 AM   #1
Spit
Member
Spit began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2009
Device: none
Angry This is driving me nuts...

Go my new Sony 505 yesterday Loving the hardware so far.

I have a few PDF's I am trying to convert and it is becoming very frustrating. I put another post on yesterday but I have moved on a bit since.

I want to remove the header and footer. I have done alot of reading, searching and playing. This is what I have done so far:

I use NitroPDF to crop the PDF (all pages option) to remove the header which has chapter title in and the footer which has the page numbers in.

I then use calibre to convert them to either Epub or LRF (tried both several times).

Copy the file to the reader using explorer.

THE HEADER AND FOOTER TEXT IS STILL IN THE BOOK ALL THE WAY THROUGH!!

If i load my cropped PDF into Adobe Reader there is no header and footer. Where is it getting the information from?

I have also tried reinstalling Calibre in case it was picking anything up from when I wasnt using NitroPDF, still no joy.

All I want to do is read without the header and footer all the way through the text, LOL
Spit is offline   Reply With Quote
Old 08-15-2009, 07:47 AM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
The PDF cropping application you are using is not removing the cropped portion from the file. It is just hiding it. Think of it this way. You have a page, within the page is a display area. The application is just making the display area smaller so that the header and footer are not shown when you view the file but they are still there.

calibre includes the ability to remove headers and footers based on regular expressions. However, I haven't gotten around to producing any GUI tools to make it easier. The process right now is a bit difficult but, it is ebook-convert in.pdf .epub --debug-input ./out_dir, look at the html produced, create a regex, convert using the --remove-header, --remove-footer and set the header and footer regexs.
user_none is offline   Reply With Quote
Old 08-15-2009, 07:52 AM   #3
Spit
Member
Spit began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2009
Device: none
Thanks. That makes sense for the crop. No way of deleting this info then?

So convert my PDF to html 1st then convert to epub and use remove header and footer?
Spit is offline   Reply With Quote
Old 08-15-2009, 07:55 AM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Remove header and footer is supported by all formats. What happens is: the input (no matter what format) is converted to html and run though a preprocessor before being turned into an internal OEB book. The preprocessor applies the regex to remove the header and footer as well as doing other things. The OEB is then turned into what ever your output format is.

PM me your PDF and I'll look at it to see what kind of regex you will need to remove the header and footer.
user_none is offline   Reply With Quote
Old 08-15-2009, 12:19 PM   #5
JIGACE
Member
JIGACE began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jul 2008
Device: EZ Reader Pocket Pro
hi... i used to have the same problem with pdf files ... im not expert but i think the only way to remove those are with either a software to convert pdf to word and then remove the header& footer manually or to use a regular expression i think, anyway i found long time ago a program that converts files to pdb files (palm ebooks) and it does it on 3 steps first it converts the pdf file to txt, then cleans the txt (the result may be wrong but almost every time deletes header & footer) and then converts this clean txt file to pdb, but you can use the clean txt file and import it to calibre (it what i do) so... it maybe suit you... hope i make myself clear cause my english ... not very good... anyway here's the link http://www.reblusoft.com/index.php?o...d=20&Itemid=49
JIGACE is offline   Reply With Quote
Old 08-16-2009, 11:53 AM   #6
DerSchwarzePrinz
Enthusiast
DerSchwarzePrinz began at the beginning.
 
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by user_none View Post
Remove header and footer is supported by all formats. What happens is: the input (no matter what format) is converted to html and run though a preprocessor before being turned into an internal OEB book. The preprocessor applies the regex to remove the header and footer as well as doing other things. The OEB is then turned into what ever your output format is.

PM me your PDF and I'll look at it to see what kind of regex you will need to remove the header and footer.
What regular expression should I use for

- xxx -

(xxx = page number, one to three digits)?

I tried to use the following (simple) RE:

\- [0-9]+ \-

But it removes only

- x - (- 1 - to - 9 -)
and
- xxx - (- 100 - to - 999 -)

but not

- xx - (- 10 - to - 99 -)

Whats wrong?

Thanks a lot for your help,
DSP

Last edited by DerSchwarzePrinz; 08-16-2009 at 12:37 PM.
DerSchwarzePrinz is offline   Reply With Quote
Old 08-16-2009, 01:28 PM   #7
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by DerSchwarzePrinz View Post
What regular expression should I use for

- xxx -

(xxx = page number, one to three digits)?
Code:
- \d{1,3} -
Also, you might find this useful.
user_none is offline   Reply With Quote
Old 08-16-2009, 03:54 PM   #8
DerSchwarzePrinz
Enthusiast
DerSchwarzePrinz began at the beginning.
 
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by user_none View Post
Code:
- \d{1,3} -
Also, you might find this useful.
Same effect, only one and three digit numbers are removed. Any ideas?

If I use "\d{1,3}" all the page numbers are removed, but naturally the hyphens stay behind.
If I use "- \d{1,3} -" all page numbers including hyphens are removed without the page numbers that have two digits.

Last edited by DerSchwarzePrinz; 08-16-2009 at 04:52 PM.
DerSchwarzePrinz is offline   Reply With Quote
Old 08-16-2009, 04:56 PM   #9
Spit
Member
Spit began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Aug 2009
Device: none
I sorted it another way, thanks for the help guys.

I used NitroPDF to export the PDF to word and chose the delete header and footer info. Then with word I saved the new doc to pdf. Then I used calibre to convert to EpuB. Long winded but worked a treat. seeing as i will only be converting 4-5 books a month its ok.
Spit is offline   Reply With Quote
Old 08-16-2009, 05:37 PM   #10
Ralob
Connoisseur
Ralob began at the beginning.
 
Posts: 53
Karma: 10
Join Date: Feb 2008
Device: iPad Pro, Kobo Libra 2, PW4
I use Abbyy FR9 to scan the pdf then I export the pdf to html. Then, I use snowsoft's htmlbookfixer to clean everything up. Works perfectly.
Ralob is offline   Reply With Quote
Old 08-16-2009, 07:52 PM   #11
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by DerSchwarzePrinz View Post
Same effect, only one and three digit numbers are removed. Any ideas?

If I use "\d{1,3}" all the page numbers are removed, but naturally the hyphens stay behind.
If I use "- \d{1,3} -" all page numbers including hyphens are removed without the page numbers that have two digits.
Hm... It could be an extra space somewhere between the -s. As long as there are no spaces between the numbers themselves this should work:

Code:
"-\s*\d{1,3}\s*-"

Quote:
Originally Posted by Spit
I sorted it another way, thanks for the help guys.
Great. As you can tell the header removal is very rough at the moment. Eventually I'm going to write a GUI tool to help. Also, eventually I want to have it auto detect the headers instead of having to use a regex. But both of those are toward the bottom of my todo list.
user_none is offline   Reply With Quote
Old 08-16-2009, 07:59 PM   #12
doreenjoy
01000100 01001010
doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.doreenjoy ought to be getting tired of karma fortunes by now.
 
doreenjoy's Avatar
 
Posts: 1,889
Karma: 2400000
Join Date: Mar 2009
Device: Polyamorous
I have had the same problem and never successfully got Calibre to remove the headers (some of the footers in particular are complicated, with 2 lines of footer). So I used the Adobe Acrobat cropping feature, saved as PDF, and suffered through reading the PDF.
doreenjoy is offline   Reply With Quote
Old 08-16-2009, 08:10 PM   #13
RandallFlagg
Hey Trashcan Man
RandallFlagg will become famous soon enoughRandallFlagg will become famous soon enoughRandallFlagg will become famous soon enoughRandallFlagg will become famous soon enoughRandallFlagg will become famous soon enoughRandallFlagg will become famous soon enough
 
RandallFlagg's Avatar
 
Posts: 66
Karma: 658
Join Date: Jan 2008
Location: So Cal
Device: Nook color, prs 505, Axim x30, psp, Acer Aspire One [running xp]
I don't know about the software your using, but I use acrobat pro and usually I crop the header and footer first. Then I go to the document tab and choose examine document. After it searches the doc, it list the items it has found. I uncheck metadata and bookmarks as I don't usually want those removed. I do however wish to remove hidden data, which is the header and footers I can no longer view. After that I can convert it with Calibre. The software your using should offer a way to remove the hidden data once you have cropped your document. I hope this helps
RandallFlagg is offline   Reply With Quote
Old 08-17-2009, 10:10 AM   #14
DerSchwarzePrinz
Enthusiast
DerSchwarzePrinz began at the beginning.
 
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by user_none View Post
Hm... It could be an extra space somewhere between the -s. As long as there are no spaces between the numbers themselves this should work:

Code:
"-\s*\d{1,3}\s*-"
Nope, doesn't work. I tried this:

"-\s*\d.+-"

That seems to work ...

What RE could I use for "normal" page numbers e. g. 1, 2, 3 ...?
I want to avoid removing numbers from the document, so I need something like "remove any number from one to three digits on a single line". Could you please give me a hint?
DerSchwarzePrinz is offline   Reply With Quote
Old 09-02-2009, 01:44 AM   #15
dionymnia
Junior Member
dionymnia began at the beginning.
 
Posts: 1
Karma: 32
Join Date: Sep 2009
Device: Kindle 2
Quote:
Originally Posted by RandallFlagg View Post
I don't know about the software your using, but I use acrobat pro and usually I crop the header and footer first. Then I go to the document tab and choose examine document. After it searches the doc, it list the items it has found. I uncheck metadata and bookmarks as I don't usually want those removed. I do however wish to remove hidden data, which is the header and footers I can no longer view. After that I can convert it with Calibre. The software your using should offer a way to remove the hidden data once you have cropped your document. I hope this helps
This worked perfectly for me - thanks for suggesting it!

- D
dionymnia is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Unutterably Silly Unsalted Nuts recluse Lounge 19 04-30-2010 07:06 PM
Going Nuts with Calibre saitekx36 Calibre 17 06-10-2009 06:04 PM
Short Fiction Ebers, Georg: The Nuts. V1. 28 Mar 2009 crutledge Kindle Books 0 03-28-2009 07:57 AM
PRS-505 Date reset driving me nuts! nathantw Sony Reader 2 07-11-2008 04:00 PM


All times are GMT -4. The time now is 12:55 AM.


MobileRead.com is a privately owned, operated and funded community.