Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-18-2009, 03:26 PM   #1
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,097
Karma: 5101571
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
PDF samples

Hi all,

I'm starting work on a new PDF conversion engine for calibre that will hopefully handle header and footer extraction and multiple column extraction as well.

I'm asking for a few sample PDF files that I can use as a test corpus. I'd appreciate it if you could just extract a few pages with different typographical features and make a new PDF file with them.

Note that this new engine will not handle mathematics/tables/vector diagrams, etc. so don't provide samples for those.

Also this is a bit of a long term project, so don't expect results too quickly.
kovidgoyal is offline   Reply With Quote
Old 09-18-2009, 04:09 PM   #2
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,142
Karma: 24387938
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro, PRS-600, Kobo Mini
Would you prefer a single PDF with extracts from different works, or separate PDFs?
Elfwreck is offline   Reply With Quote
 
Enthusiast
Old 09-18-2009, 04:11 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,097
Karma: 5101571
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
separate PDFs as the algorithm is likely to take into account overall document structure as well. I just dont want very large PDFs and also if they're copyrighted, its best to just extract a small subsection.
kovidgoyal is offline   Reply With Quote
Old 09-18-2009, 04:32 PM   #4
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,142
Karma: 24387938
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro, PRS-600, Kobo Mini
Sample PDFs, all Creative Commons or or other ok-to-distribute files.

They range from simple mainstream novel PDFs to nightmarish magazine formatting. Some have pictures; some have links.
  • ANAT annual report 2008: Colored text, multi-column layout. (CC)
  • James Boyle's The Public Domain: Changes in margins & leading; lists (CC)
  • Helen Keller's essay I Learned to Speak: Non-crucial font & margin changes; should convert well. (PD text)
  • Lenz v Universal: standard legal document layout (PD)
  • Lowry Pei's For Adam: Novel with irregular page breaks & nonstandard headers; should convert well, might be worth noting how the spacing carried over. (CC)
  • TWC-1: Journal; columns with pictures (permission to share)
  • Wick's Houses of the Blooded: RPG game book; nightmarish; not expected to convert well at all. (permission to share)

Metadata's likely all over the place. Some have it; some don't.
Attached Files
File Type: zip Sample PDF Extracts.zip (1.11 MB, 113 views)

Last edited by Elfwreck; 09-18-2009 at 04:37 PM.
Elfwreck is offline   Reply With Quote
Old 09-18-2009, 04:34 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,097
Karma: 5101571
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Thanks, will come in handy.
kovidgoyal is offline   Reply With Quote
Old 09-19-2009, 08:30 AM   #6
neilmarr
neilmarr
neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.neilmarr ought to be getting tired of karma fortunes by now.
 
neilmarr's Avatar
 
Posts: 7,229
Karma: 6000059
Join Date: Apr 2009
Location: Monaco-Menton, France
Device: sony
Hello there, Kovid: I just sent you an email, but in case it gets lost; I head up the editorial team at a small indie publishing house that's covered all its paperback novels with PDF versions for the past eight years (we have about 120 titles at www.bewrite.net). If you think it would be of any help, please drop me a line and I'll send as many PDF ebooks as you might need to experiment with. Most are straight text, all well-formatted, some use more than one font and a few carry inside illustration. All are front covered.

Cheers and good luck. Neil

Last edited by neilmarr; 09-19-2009 at 08:31 AM. Reason: to check my email address was included in sig line
neilmarr is offline   Reply With Quote
Old 09-20-2009, 03:20 PM   #7
darkmonk
Connoisseur
darkmonk began at the beginning.
 
Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove.
Attached Files
File Type: pdf output.pdf (263.3 KB, 165 views)
darkmonk is offline   Reply With Quote
Old 09-20-2009, 03:48 PM   #8
Pablo
Guru
Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.
 
Pablo's Avatar
 
Posts: 741
Karma: 3058207
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
Quote:
Originally Posted by darkmonk View Post
Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove.
What an interesting book! I wonder if it is legal to reformat and redistribute it. I doesn't include any license information.
Pablo is offline   Reply With Quote
Old 09-20-2009, 03:54 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,097
Karma: 5101571
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.
kovidgoyal is offline   Reply With Quote
Old 09-20-2009, 04:58 PM   #10
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,459
Karma: 986493
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by kovidgoyal View Post
@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.
It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.
user_none is offline   Reply With Quote
Old 09-20-2009, 05:13 PM   #11
acidzebra
Liseuse Lover
acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.
 
acidzebra's Avatar
 
Posts: 869
Karma: 1035404
Join Date: Jul 2008
Location: Netherlands
Device: PRS-505
Quote:
Originally Posted by kovidgoyal View Post
And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.
I hope there will still be an "show advanced/all features" button or something like that
acidzebra is offline   Reply With Quote
Old 09-20-2009, 05:25 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,097
Karma: 5101571
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by user_none View Post
It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.
Yeah that's what remove content will do
kovidgoyal is offline   Reply With Quote
Old 09-22-2009, 10:27 AM   #13
aleks
Connoisseur
aleks doesn't litteraleks doesn't litteraleks doesn't litter
 
aleks's Avatar
 
Posts: 89
Karma: 205
Join Date: Jul 2006
Location: Upstate NY
Device: Rocket eBook & Sony Reader
Hi Kovid,

Here is something with columns for you to work on.

Kozak
aleks is offline   Reply With Quote
Old 09-23-2009, 06:40 AM   #14
mrmikel
Book Twiddler
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
some pdf to work on

Here is a pdf from the UK, the early history of their air force. It have tried and tried to convert this and everything I did was unsuccessful without endless manual work, so good luck!
Attached Files
File Type: pdf earlyraf.pdf (657.0 KB, 136 views)
mrmikel is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Any way to differentiate samples? carld Calibre 1 07-26-2010 10:51 PM
Hacks Looking for samples of the new fonts Granvillen Amazon Kindle 3 06-24-2010 05:10 PM
Classic Can't download samples Sakura Barnes & Noble NOOK 2 04-28-2010 12:12 AM
Classic Samples from BN not downloading robslp Barnes & Noble NOOK 1 04-21-2010 05:58 PM
Samples now available without Whispernet AnemicOak Amazon Kindle 0 10-09-2009 04:42 PM


All times are GMT -4. The time now is 10:51 PM.


MobileRead.com is a privately owned, operated and funded community.