|
|
#1 |
|
Creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,501
Karma: 2944574
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
PDF samples
I'm starting work on a new PDF conversion engine for calibre that will hopefully handle header and footer extraction and multiple column extraction as well. I'm asking for a few sample PDF files that I can use as a test corpus. I'd appreciate it if you could just extract a few pages with different typographical features and make a new PDF file with them. Note that this new engine will not handle mathematics/tables/vector diagrams, etc. so don't provide samples for those. Also this is a bit of a long term project, so don't expect results too quickly.
__________________
Get calibre Notice to all: I can not provide assistance with DRM removal, for legal reasons, so please do not contact me about it. |
|
|
|
|
|
#2 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,988
Karma: 21320468
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro
|
Would you prefer a single PDF with extracts from different works, or separate PDFs?
__________________
When I'm not here, I'm somewhere else. Mind the rainbows. |
|
|
|
|
Enthusiast
|
|
|
|
#3 |
|
Creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,501
Karma: 2944574
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
separate PDFs as the algorithm is likely to take into account overall document structure as well. I just dont want very large PDFs and also if they're copyrighted, its best to just extract a small subsection.
__________________
Get calibre Notice to all: I can not provide assistance with DRM removal, for legal reasons, so please do not contact me about it. |
|
|
|
|
|
#4 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,988
Karma: 21320468
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Clié; PRS-505; EZR Pocket Pro
|
Sample PDFs, all Creative Commons or or other ok-to-distribute files.
They range from simple mainstream novel PDFs to nightmarish magazine formatting. Some have pictures; some have links.
Metadata's likely all over the place. Some have it; some don't.
__________________
When I'm not here, I'm somewhere else. Mind the rainbows. Last edited by Elfwreck; 09-18-2009 at 04:37 PM. |
|
|
|
|
|
#5 |
|
Creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,501
Karma: 2944574
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Thanks, will come in handy.
__________________
Get calibre Notice to all: I can not provide assistance with DRM removal, for legal reasons, so please do not contact me about it. |
|
|
|
|
|
#6 |
|
neilmarr
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,233
Karma: 5881759
Join Date: Apr 2009
Location: Monaco-Menton, France
Device: sony
|
Hello there, Kovid: I just sent you an email, but in case it gets lost; I head up the editorial team at a small indie publishing house that's covered all its paperback novels with PDF versions for the past eight years (we have about 120 titles at www.bewrite.net). If you think it would be of any help, please drop me a line and I'll send as many PDF ebooks as you might need to experiment with. Most are straight text, all well-formatted, some use more than one font and a few carry inside illustration. All are front covered.
Cheers and good luck. Neil Last edited by neilmarr; 09-19-2009 at 08:31 AM. Reason: to check my email address was included in sig line |
|
|
|
|
|
#7 |
|
Connoisseur
![]() Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
|
Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove. |
|
|
|
|
|
#8 | |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 627
Karma: 1901287
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
Quote:
|
|
|
|
|
|
|
#9 |
|
Creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,501
Karma: 2944574
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.
__________________
Get calibre Notice to all: I can not provide assistance with DRM removal, for legal reasons, so please do not contact me about it. |
|
|
|
|
|
#10 |
|
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,384
Karma: 848775
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.
|
|
|
|
|
|
#11 |
|
Liseuse Lover
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 869
Karma: 8000
Join Date: Jul 2008
Location: Netherlands
Device: PRS-505
|
|
|
|
|
|
|
#12 | |
|
Creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,501
Karma: 2944574
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
__________________
Get calibre Notice to all: I can not provide assistance with DRM removal, for legal reasons, so please do not contact me about it. |
|
|
|
|
|
|
#13 |
|
Connoisseur
![]() ![]() ![]() Posts: 89
Karma: 205
Join Date: Jul 2006
Location: Upstate NY
Device: Rocket eBook & Sony Reader
|
Hi Kovid,
Here is something with columns for you to work on. Kozak |
|
|
|
|
|
#14 |
|
Book Twiddler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 975
Karma: 1087515
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
some pdf to work on
Here is a pdf from the UK, the early history of their air force. It have tried and tried to convert this and everything I did was unsuccessful without endless manual work, so good luck!
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Any way to differentiate samples? | carld | Calibre | 1 | 07-26-2010 10:51 PM |
| Hacks Looking for samples of the new fonts | Granvillen | Amazon Kindle | 3 | 06-24-2010 05:10 PM |
| Classic Can't download samples | Sakura | Barnes & Noble NOOK | 2 | 04-28-2010 12:12 AM |
| Classic Samples from BN not downloading | robslp | Barnes & Noble NOOK | 1 | 04-21-2010 05:58 PM |
| Samples now available without Whispernet | AnemicOak | Amazon Kindle | 0 | 10-09-2009 04:42 PM |