Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 02-09-2009, 05:32 PM   #1
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Expert help required : Cleaning bad pdf scans

Hi guys,

well here's my first post, i have looked a bit for my answer but couldn't find anything. So here is my problem :



I have about 20 books of 100 pages like these, all in pdf format. Needless to say i need to clean them up.

Here is what i need to do :

1:Extract all pdf to images. (what format is best? tiff?)
2: Remove the black borders (crop them)
3: Divide the images in half so only one page shows per pdf page.
4: Center all the pages and make them viewable on a prs-505.

I'm guessing i will have to extract the images from the pdfs
and work with those, but doing one by one for each 200+ pages book seems crazy. Is there an easy way of doing this by batch and what software would i need, free or not, to extract, crop, divide and resize.


Any help is appreciated thanks!

Edit: Software tested : adobe acrobat, readiris, photoshop, amber pdf converter and some more... couldn t find any way to batch thing up atleast a bit... will probably need to do 4 or so steps but hopefully not manually crop over 1000 pages... .

Last edited by Student1; 02-09-2009 at 05:51 PM. Reason: adding information
Student1 is offline   Reply With Quote
Old 02-10-2009, 04:51 AM   #2
owl123
Addict
owl123 doesn't litterowl123 doesn't litterowl123 doesn't litter
 
Posts: 234
Karma: 214
Join Date: Nov 2008
Device: Galaxy Note 3, Galaxy NotePro 12.2, InkBook
You could use Adobe Acrobat (not Reader). It supports both cropping and dividing. There would be no need to extract images.

Batch processing won't work here as my guess is that all these 20 documents have different sizes (i.e. need to have cropping set individually).
owl123 is offline   Reply With Quote
Old 02-10-2009, 05:30 AM   #3
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Quote:
Originally Posted by owl123 View Post
You could use Adobe Acrobat (not Reader). It supports both cropping and dividing. There would be no need to extract images.

Batch processing won't work here as my guess is that all these 20 documents have different sizes (i.e. need to have cropping set individually).

thanks, i have tried to use it, but it was just too long. I have found a software that claims that can crop based on black borders... pixedit, will give it a try and will report back!

Last edited by Student1; 02-10-2009 at 05:44 AM.
Student1 is offline   Reply With Quote
Old 02-10-2009, 10:34 AM   #4
slayda
Retired & reading more!
slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.slayda ought to be getting tired of karma fortunes by now.
 
slayda's Avatar
 
Posts: 2,764
Karma: 1884247
Join Date: Sep 2006
Location: North Alabama, USA
Device: Kindle 1, iPad Air 2, iPhone 6S+, Kobo Aura One
I have ABBYY Finereader, an OCR program. It does all those things (but only semi batch - since all the crops are slightly different) plus you can OCR it, save it to a MS Word file and use BookDesigner to change it to LRF.

Note: it will require more work if there are images but it works well with text and few images. (Also not cheap)
slayda is offline   Reply With Quote
Old 02-10-2009, 01:49 PM   #5
Elfwreck
Grand Sorcerer
Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.Elfwreck ought to be getting tired of karma fortunes by now.
 
Elfwreck's Avatar
 
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
I adore Pixedit; have been using it for years. It will indeed crop the borders, but (unless there's a newer, updated version) won't do so automatically.

Quickest crop is using the selection tool (the one that's on by default when you open a document) to surround the content you want to keep, and then shift-delete to remove everything outside of it.

You can also use the auto-deskew, either page-by-page or for a whole document. However, it sometimes picks an odd horizontal point and skews the whole page badly; you can Undo this (on each page individually, rather than the whole set at once) and manually deskew the page instead.

Filtering to remove objects smaller than 3 pixels will get rid of most speckles for up-to-300dpi scans. When re-setting the filter numbers, watch your I's and .'s; when it starts to remove dots from letters & punctuation, it's set too high.
Elfwreck is offline   Reply With Quote
Old 02-10-2009, 03:48 PM   #6
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Thank guys, i' ll try these softwares and see how it goes! I tried a few ocr softwares to test but wasn t satisfied, seems it always messes up on some pages (when there are graphics). Firereader seems interesting, if it can split pages in half its a done deal ! Thanks for your help guys i ll explore a bit and see from there!
Student1 is offline   Reply With Quote
Old 02-10-2009, 04:46 PM   #7
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Wow ABBYY Finereader is a BEAST ! Does it almost perfect! Even with pictures, and thats just on the default setting without adjusting anything... it converted the book almost flawlesly to word with all images intact! Still 2 pages, but i haven t messed with the options yet, just wanted to thank you for letting me know about this powerfull soft! Even Omnipage 16 pro could not do what ABBYY Finereader can do! I am amazed!!!
Student1 is offline   Reply With Quote
Old 02-10-2009, 07:59 PM   #8
Hodapp87
Junior Member
Hodapp87 began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jan 2009
Device: Bebook
I would probably do something like...
1) Use pdfimages to extract all images.
2) Open a few images and figure out a color curve that pushes all text to black and most background to white. Save a gradient of this that imagemagick can grok. Figure out where to set some basic crop boxes. This assumes
3) Use imagemagick to crop and split the series of images all at once.
4) Use imagemagick to apply the color curve to all the images.
5) Do something useful with the series of converted images... DjVu, OCR, whatever.
Hodapp87 is offline   Reply With Quote
Old 02-11-2009, 02:13 PM   #9
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Thanks for the alternative way, what i used was easy reader 9 to crop and auto divide the pages. Cropping tools were very fast, had to do them one by one but could keep the same dimensions and still switch to the next page. Then i converted to tiff as i wanted to save in pdf/a image only (no ocr). Used adobe acrobat to rebuild the tiff into 1 pdf and saved as pdf/a. Worked perfect! Thanks for all your help!!!
Student1 is offline   Reply With Quote
Old 02-11-2009, 08:28 PM   #10
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Maybe one last question, it seems my pdf are 1o times larger (40 megs)than the original files, i surely didn t add any more details as i used the same scanned images from the original (4 megs). I did save in pdf/a, so any tip on dropping the size without changing the original quality?
Student1 is offline   Reply With Quote
Old 02-11-2009, 09:23 PM   #11
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Try unpaper (post-processing scanned and photocopied book pages) to clean up bad .pdf scans.

It's an option callable from within PDFRead v1.8. You enter its options in the (white) input box as seen in the GUI input screen below:


Last edited by nrapallo; 02-11-2009 at 09:46 PM. Reason: added GUI screen picture
nrapallo is offline   Reply With Quote
Old 02-11-2009, 10:09 PM   #12
Student1
Groupie
Student1 doesn't litterStudent1 doesn't litter
 
Posts: 159
Karma: 170
Join Date: Feb 2009
Device: PRS-505
Thanks going to give it a try!
Student1 is offline   Reply With Quote
Old 03-03-2009, 05:57 AM   #13
HOH
Junior Member
HOH began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2009
Device: none
Smile PixEdit

Quote:
Originally Posted by Elfwreck View Post
I adore Pixedit; have been using it for years. It will indeed crop the borders, but (unless there's a newer, updated version) won't do so automatically.

Quickest crop is using the selection tool (the one that's on by default when you open a document) to surround the content you want to keep, and then shift-delete to remove everything outside of it.

You can also use the auto-deskew, either page-by-page or for a whole document. However, it sometimes picks an odd horizontal point and skews the whole page badly; you can Undo this (on each page individually, rather than the whole set at once) and manually deskew the page instead.

Filtering to remove objects smaller than 3 pixels will get rid of most speckles for up-to-300dpi scans. When re-setting the filter numbers, watch your I's and .'s; when it starts to remove dots from letters & punctuation, it's set too high.
Where did you download PixEdit ? I would like to try it too !
Regards
HOH
HOH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sony PRS-600 for PDF Magazine Scans? andycorleone Which one should I buy? 9 11-24-2009 05:41 AM
Bad PDF output B81 PDF 1 09-21-2009 05:20 AM
Cleaning bad characters alexxxm Sony Reader 27 01-10-2008 02:49 AM
PDF Book Scans? jalm1 Sony Reader 2 02-05-2007 04:48 PM
PDF documents made from scans on ebook readers? claudioita Sony Reader 7 11-28-2006 09:47 AM


All times are GMT -4. The time now is 03:14 AM.


MobileRead.com is a privately owned, operated and funded community.