Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-25-2009, 05:01 PM   #1
davef
Member
davef is on a distinguished road
 
Posts: 13
Karma: 60
Join Date: Feb 2009
Device: PRS-700
Evading headers in PDF->EPUB conversion

Some things are just only available in PDF. (boo!)
Calibre does a pretty good job converting basic text PDFs to EPUBs. (yea!)
Except for the freakin' headers/footers. (boo!)

If it was just the author's name over and over... Or the book title over and over.. I could easily do a global replace in the EPUB source and get rid of 'em.
But there always have to be page numbers which makes them all different.
Just eliminating the author/title leaves these numbers strewn through the book.

So I wondered... is there some way to get Calibre to ignore anything it finds in the top (or bottom) inch (or so) of the PDF page?

Or some other approach...

Thanks!
Dave
davef is offline   Reply With Quote
Old 08-26-2009, 03:19 AM   #2
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 3,942
Karma: 777817
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
In the latest 0.6.x series of Calibre there is the facility to provide a regular expression to identify the headers and/or footers.

This is bit esoteric in that many people do not udnerstand regular expressions, but I believe that there are plans to provide a GUI interfact to this at some time in the future.
itimpi is offline   Reply With Quote
 
Enthusiast
Old 08-26-2009, 04:46 AM   #3
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 60,974
Karma: 38191451
Join Date: Nov 2006
Location: UK
Device: Kindle PW2, iPad Retina Mini, iPhone 4, MS Surface Pro
Another approach would be to use a tool like Book Designer (which does a better job at PDF import than Calibre, anyway). Import the PDF into BD, get rid of all the extraneous stuff, and then save as HTML, and import the HTML into Calibre.
HarryT is online now   Reply With Quote
Old 08-26-2009, 11:30 AM   #4
davef
Member
davef is on a distinguished road
 
Posts: 13
Karma: 60
Join Date: Feb 2009
Device: PRS-700
Thanks to both!
I downloaded Book Designer and will have to find some time to play.
I also grabbed Harry's Book Designer tutorial which I'm sure will be a huge help.
Thanks again...
davef is offline   Reply With Quote
Old 08-28-2009, 07:20 PM   #5
igorsk
Wizard
igorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfolded
 
Posts: 3,443
Karma: 52235
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
If you can afford it, get ABBYY Finereader or PDF Transformer. They do a great job of convering PDFs, both text- and image-based.
igorsk is offline   Reply With Quote
Old 08-29-2009, 11:01 AM   #6
JvdW
Zealot
JvdW doesn't litterJvdW doesn't litter
 
Posts: 115
Karma: 146
Join Date: Jul 2008
Location: Netherlands Veenendaal
Device: Palm T5, Sony PRS-505, Nook Color
Another good one is from Nuance (http://www.nuance.com/imaging/pdfcon...ofessional.asp). This one managed to convert a pdf to something that actually resembled the original. Tried the same pdf with Acrobat and it ended up as one big paragraph :-((

Regards,

Joop
JvdW is offline   Reply With Quote
Old 08-29-2009, 03:26 PM   #7
JvdW
Zealot
JvdW doesn't litterJvdW doesn't litter
 
Posts: 115
Karma: 146
Join Date: Jul 2008
Location: Netherlands Veenendaal
Device: Palm T5, Sony PRS-505, Nook Color
Just to let you know that I might have found something that might help you too regarding the removal of headers/footers.
The following is what I copied from the debug output of Calibre (.6.10) and that I want removed:

Code:
<br>
5<br>
<hr>
<A name=7></a>
After playing around with the remove footer regexp I came up with the following:
Code:
(?ims)<br>\s*\d{1,3}\s*<br>\s<hr>\s<a name=\d{1,3}></a>
This could probably be improved but it works for me.
It isn't perfect because sentences that continue on the other page aren't always strung together but it beats manually removing pagenumbers ;-)

Googling for some help I found two programs that really helped me, YMMV:
Regex Coach : http://weitz.de/regex-coach/
Kodos : http://kodos.sourceforge.net/
Where I found Regex Coach the better one with more possibilities and better info on what is happening.

Regards,

Joop
JvdW is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to ePub conversion issue - headers getting left in deadSkip Calibre 7 07-09-2010 02:07 AM
HTML Conversion - Multiline Headers prky Calibre 1 07-03-2010 09:24 AM
PDF to EPUB conversion jfontana Calibre 2 03-17-2010 03:09 AM
pdf to epub conversion mediax Sigil 16 11-19-2009 03:48 PM
Help with conversion from PDF to EPUB Fizz Calibre 5 10-25-2009 11:48 AM


All times are GMT -4. The time now is 01:24 AM.


MobileRead.com is a privately owned, operated and funded community.