View Full Version : Expert help required : Cleaning bad pdf scans


Student1
02-09-2009, 05:32 PM
Hi guys,

well here's my first post, i have looked a bit for my answer but couldn't find anything. So here is my problem :

http://filebeam.com/46ff08e931f15517146b4991b922b5f9.jpg

I have about 20 books of 100 pages like these, all in pdf format. Needless to say i need to clean them up.

Here is what i need to do :

1:Extract all pdf to images. (what format is best? tiff?)
2: Remove the black borders (crop them)
3: Divide the images in half so only one page shows per pdf page.
4: Center all the pages and make them viewable on a prs-505.

I'm guessing i will have to extract the images from the pdfs
and work with those, but doing one by one for each 200+ pages book seems crazy. Is there an easy way of doing this by batch and what software would i need, free or not, to extract, crop, divide and resize.


Any help is appreciated thanks!

Edit: Software tested : adobe acrobat, readiris, photoshop, amber pdf converter and some more... couldn t find any way to batch thing up atleast a bit... will probably need to do 4 or so steps but hopefully not manually crop over 1000 pages... ;).

owl123
02-10-2009, 04:51 AM
You could use Adobe Acrobat (not Reader). It supports both cropping and dividing. There would be no need to extract images.

Batch processing won't work here as my guess is that all these 20 documents have different sizes (i.e. need to have cropping set individually).

Student1
02-10-2009, 05:30 AM
You could use Adobe Acrobat (not Reader). It supports both cropping and dividing. There would be no need to extract images.

Batch processing won't work here as my guess is that all these 20 documents have different sizes (i.e. need to have cropping set individually).


thanks, i have tried to use it, but it was just too long. I have found a software that claims that can crop based on black borders... pixedit, will give it a try and will report back!

slayda
02-10-2009, 10:34 AM
I have ABBYY Finereader, an OCR program. It does all those things (but only semi batch - since all the crops are slightly different) plus you can OCR it, save it to a MS Word file and use BookDesigner to change it to LRF.

Note: it will require more work if there are images but it works well with text and few images. (Also not cheap)

Elfwreck
02-10-2009, 01:49 PM
I adore Pixedit; have been using it for years. It will indeed crop the borders, but (unless there's a newer, updated version) won't do so automatically.

Quickest crop is using the selection tool (the one that's on by default when you open a document) to surround the content you want to keep, and then shift-delete to remove everything outside of it.

You can also use the auto-deskew, either page-by-page or for a whole document. However, it sometimes picks an odd horizontal point and skews the whole page badly; you can Undo this (on each page individually, rather than the whole set at once) and manually deskew the page instead.

Filtering to remove objects smaller than 3 pixels will get rid of most speckles for up-to-300dpi scans. When re-setting the filter numbers, watch your I's and .'s; when it starts to remove dots from letters & punctuation, it's set too high.

Student1
02-10-2009, 03:48 PM
Thank guys, i' ll try these softwares and see how it goes! I tried a few ocr softwares to test but wasn t satisfied, seems it always messes up on some pages (when there are graphics). Firereader seems interesting, if it can split pages in half its a done deal :)! Thanks for your help guys i ll explore a bit and see from there!

Student1
02-10-2009, 04:46 PM
Wow ABBYY Finereader is a BEAST :)! Does it almost perfect! Even with pictures, and thats just on the default setting without adjusting anything... it converted the book almost flawlesly to word with all images intact! Still 2 pages, but i haven t messed with the options yet, just wanted to thank you for letting me know about this powerfull soft! Even Omnipage 16 pro could not do what ABBYY Finereader can do! I am amazed!!!

Hodapp87
02-10-2009, 07:59 PM
I would probably do something like...
1) Use pdfimages to extract all images.
2) Open a few images and figure out a color curve that pushes all text to black and most background to white. Save a gradient of this that imagemagick can grok. Figure out where to set some basic crop boxes. This assumes
3) Use imagemagick to crop and split the series of images all at once.
4) Use imagemagick to apply the color curve to all the images.
5) Do something useful with the series of converted images... DjVu, OCR, whatever.

Student1
02-11-2009, 02:13 PM
Thanks for the alternative way, what i used was easy reader 9 to crop and auto divide the pages. Cropping tools were very fast, had to do them one by one but could keep the same dimensions and still switch to the next page. Then i converted to tiff as i wanted to save in pdf/a image only (no ocr). Used adobe acrobat to rebuild the tiff into 1 pdf and saved as pdf/a. Worked perfect! Thanks for all your help!!!

Student1
02-11-2009, 08:28 PM
Maybe one last question, it seems my pdf are 1o times larger (40 megs)than the original files, i surely didn t add any more details as i used the same scanned images from the original (4 megs). I did save in pdf/a, so any tip on dropping the size without changing the original quality?

nrapallo
02-11-2009, 09:23 PM
Try unpaper (http://unpaper.berlios.de/) (post-processing scanned and photocopied book pages) to clean up bad .pdf scans.

It's an option callable from within PDFRead v1.8 (http://www.mobileread.com/forums/showthread.php?t=21906). You enter its options in the (white) input box as seen in the GUI input screen below:

http://www.mobileread.com/forums/attachment.php?attachmentid=11860&d=1206899466

Student1
02-11-2009, 10:09 PM
Thanks going to give it a try! :)

HOH
03-03-2009, 05:57 AM
I adore Pixedit; have been using it for years. It will indeed crop the borders, but (unless there's a newer, updated version) won't do so automatically.

Quickest crop is using the selection tool (the one that's on by default when you open a document) to surround the content you want to keep, and then shift-delete to remove everything outside of it.

You can also use the auto-deskew, either page-by-page or for a whole document. However, it sometimes picks an odd horizontal point and skews the whole page badly; you can Undo this (on each page individually, rather than the whole set at once) and manually deskew the page instead.

Filtering to remove objects smaller than 3 pixels will get rid of most speckles for up-to-300dpi scans. When re-setting the filter numbers, watch your I's and .'s; when it starts to remove dots from letters & punctuation, it's set too high.

Where did you download PixEdit ? I would like to try it too !
Regards
HOH