![]() |
#1 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
processing scanned data into nice pdfs.
Hi people! So now I got confident with my iliad I need to fill it with content
![]() So I now scanned through a book, and also for the latest course, they offered me directly the PDFs they scanned to use for the iliad they otherwise normally print their course lecture they sell for copycosts to the students. However the problem is formatting. When I for example scanned the book, I put it front face down on the scanner, 2 sites fitting on an A4 page for one scan. Now I got all the .pngs (I used OmniPage to scan and export it), but I need to split the images, and rotate them. So is there any tool, that can comfortable clip/rotate a whole batch of images? Ideally one should be able to specify settings for one page, since the clipping region differs from book to book, or scanned paper, but usually stays the same through the whole title. Right now I got my first 10 page article readable on the iliad, without zooming/panning woes. I exported the PDF I got into a set of png's. Used Gimp to open all of them. Copy/pasted the pages to new images, rotated them, saved them. Imported them back into OmniPage, printed it to pdf. And downloaded it on the Iliad. Whola! This was do-able with a 10 page article and took a hour! (also with figuring some things out). I could have actually read the article in this time on printed paper instead ![]() But well, who knows of any tool, or knows how to such tasks. I guess it is something many people will encounter when using their iliad.... |
![]() |
![]() |
![]() |
#2 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 325
Karma: 1725
Join Date: Dec 2007
Location: Münster, Germany
Device: iRex iLiad v2
|
I faced a very similar problem after scanning a book over our university's scanner system. I didn't find a suitable software for that purpose, so I simply wrote one myself
![]() It's basically a Linux bash script that extracts the two pages from the pdf, does some image manipulation like contrast and gamma correction (using ImageMagick) and writes back a PDF file with the optimized version. If you're interested and have a Linux box around I will upload an pre-alpha version of it... |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 127
Karma: 9856
Join Date: Dec 2007
Location: Ontario, Canada
Device: Sony PRS-300/Kindle Keyboard/iPad Mini
|
Two options come to mind, one free and the other not.
The free option (for Windows only) would be Irfanview. Its batch conversion option can be used to crop to specified coordinates, rotate, etc, and it can automatically name the resulting files based on a template. One pass through your files would give you the "odd" pages and a second, with a different set of cropping coordinates, would give you the "even" pages. The commercial software that I'd recommend is Vuescan, from Hamrick Software, is available for Windows, Mac OS-X, and Linux. Despite its name, it isn't just for use with scanners - it will treat files from disk as if they were from a scanner. Assuming that your files are of uniform size, you can set the area you want it to "scan" on one file and tell it to process the whole directory, naming the resulting files automatically. You could definitely do two runs through the files, one for odd pages and a second for even pages, but I think you might also be able to use the multi-crop option to get both pages in one pass through the files. The only catch here is that I don't think Vuescan supports PNG ![]() I love my Vuescan Professional software; I use it to scan everything to either RAW TIFF or DNG (digital negative) format, then use Vuescan again at a later date to process the raw files to produce a multi-page PDF or whatever I need. It's a truly outstanding piece of software with exceptionally good licensing terms (you get the right to install it on up to three computers for a very reasonable fee). |
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
-thomas-, yes I'm interested and have linux around everywhere.
dcalder, thanks I will check out both! |
![]() |
![]() |
![]() |
#5 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 325
Karma: 1725
Join Date: Dec 2007
Location: Münster, Germany
Device: iRex iLiad v2
|
OK, here is a bash script for converting scanned PDFs into single page PDFs. Actually, it's more a concept of a bash script, as there's nothing like error-checking etc. So if one of the commands exits with errors, the script will go on as if nothing happened
![]() Usage instructions are inside the script, so please read the comments :-) Have fun! |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
-Thomas- Thanks, used it, works very fine! will continue to use it!
(Not that I wouldn't dream about a visual GUI, where you can select the rectangles ![]() |
![]() |
![]() |
![]() |
#7 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
Oje, I just noticed the quality of the text/images is greatly reduced in this splitted up pdf in comparison to the original pdf zoomed to fit screen. I hardly cant read it anymore
I guess there are too much compression/decompression processes going on in this process. Well, I will try how it looks when just using "convert", to manipulate the tiff array, and -adjoin it into a pdf. "convert" has BTW cropping too! |
![]() |
![]() |
![]() |
#8 | ||
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 325
Karma: 1725
Join Date: Dec 2007
Location: Münster, Germany
Device: iRex iLiad v2
|
Quote:
Quote:
If you make some improvements or have an interesting hack for the script please let me know... A GUI for it is on my list too, but at the moment I don't have enough time to code it... |
||
![]() |
![]() |
![]() |
#9 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
You are right! convert -adjoin will create an uncompressed pdf, 1.2GB here for a 350 pages book! (beside taking an enternity)
I played a lot the last days to convert a book I scanned, below is the script with which I get a result I'm quite pleased with. I scanned the book with 300dpi black&white. I don't know if grayscaling would have given better results, that time I thought the iliad works well with black and white, and so I wanted that... Maybe in future one should pick other settings? I scanned with omnipage and exported the images as uncompressed TIFs so to not have an quality loss. Problem was: the fonts on the scanned images are sometimes only 1 or 2 pixels wide, and when the iLiad zooms this out to fit the screen, it gets quite ugly. So I apply a very small gaussian blur and reconvert everything that has gotten a bit gray to pitch black again, this gets a nice, a bit more fat font that will look cleanly when zoomed. tiff2ps conveniently takes multiple images to be adjoined to 1 ps file. Strangely enough tiff2pdf cannot do it (one would think there is the same command processor behind these, but oh well). ps2pdf at the last step automatically compresses the images. The remaining problem I have with is PAPERSIZE in ps2pdf, it defaults to A4, and I set it to A5 because it aprox. matches the pages of this book I'd like to either set the size in pixels, but it takes inches/72 as argument instead, or ideally want it to keep the size of the trimmed images. Dunno how to do that. The whole script still takes an hour on my 1.2 Ghz notebook for this 350 pages book, but its worth it, as long the result is something you want ![]() Code:
#!/bin/bash # parameter: tif files are prefixed with this: NAME=schone_ # parameter: crop areas (WIDTHxHEIGHT+OffsetLeft+OffsetTop) CROPLEFT=1800x2200+10+70 CROPRIGHT=1800x2200+1800+70 # this makes the text a bit bolder, so it looks nice when resized on the iliad: CONVERTOPTS="-trim -blur 1 -threshold 65534" shopt -s extglob mkdir tmp #determine number of pages there.. MAXPAGE=1; while test -f $NAME*(0)$MAXPAGE.TIF; do let MAXPAGE++; done let MAXPAGE-- echo "converting $MAXPAGE pages, this will take a while!" echo "STEP 1 OF 4: preparing left pages" for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPLEFT $CONVERTOPTS $NAME*(0)$i.TIF tmp/l$i.TIF; done echo "STEP 2 OF 4: preparing right pages" for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPRIGHT $CONVERTOPTS $NAME*(0)$i.TIF tmp/r$i.TIF; done echo "STEP 3 OF 4: adjoining pages" ALL= for ((i = 1; $i <= $MAXPAGE; i++)) do ALL="$ALL tmp/l$i.TIF tmp/r$i.TIF"; done tiff2ps $ALL > output.ps echo "STEP 4 OF 4: converting/compressing to PDF" ps2pdf -sPAPERSIZE=a5 output.ps echo "done!" |
![]() |
![]() |
![]() |
#10 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hi
My advice: 1 - Scan your book with the scanner tool for scanning and save in TIFF format. Every tool as a rectangular “just scan this part” possibility, use it to define your scan page. Scan one page, preview, move the rectangular “mark” to the next page and scan. Go to the next “2 pages” and do the process again. This is slower then scanning 2 pages at a time, but, you get your pages cropped the way you want and turned to the cored side (so no problems with pages in the right direction and the others upside down); 2 - OCR with Omnipage and create your PDF. You will see it looks slower, but at the long term it becomes faster and you are in control of all the process. Best regards, |
![]() |
![]() |
![]() |
#11 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
I haven't got acceptable results with OCR... every page requries a lot of hand-work until the text is correct. At least the way omnipage works for me Its not an option for me. (and I did try the google OCR also, it was even worse).. OCR helps you, because its faster than typewriting... but until I get a whole book OCRed I could have easily read it in the same time on conventional paper also.
Don't think scanning with 2 rects will get better image based pdfs, I think we got this solved up quite nicely here. Scan the whole page at once, and let the script pick it appart! I guess it depends on the size of the text-bundle you want to digitize which method is faster. Scanning 2 pages at once still took me 2 afternoons to scan this 350 pages book. I don't want it to take any longer... |
![]() |
![]() |
![]() |
#12 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hi
I’m sorry I miss understood. You say that you scan into Omnipage and Omnipage is a OCR program so I thought you where into doing the OCR into your images and then build the PDF. If you are just making image PDF’s I still think using the scanner own scanning interface is faster and easier then Using Omnipage. Best regards, |
![]() |
![]() |
![]() |
#13 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
|
![]() |
![]() |
![]() |
#14 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 914
Join Date: Mar 2008
Device: iliad
|
If anybody interested, I optimized my script. Now some computation time is safed by doing processor intensive tasks only once with intermediate file, and the resulting pdf file has now exactly the size of the images:
This time I converted a book, I scanned in a year ago, that time gray scaled, required a little different convertion parameters to look nice again, but the end result is aprox. the same, so for my experience it doesn't matter if the original scan is gray or monochrome. Note you can watch the pages being build up by saying "gthumb tmp" in another shell in the same directory, so you don't have to run through the whole script to see if the pages will look like what you want or if they don't. Oh Yes the script doesn't clean up, you'll have to do by hand ![]() Code:
#!/bin/bash # parameter: tif files are prefixed with this: NAME=turkle_ # parameter: crop areas (WIDTHxHEIGHT+OffsetLeft+OffsetTop) CROPLEFT=750x1100+200+0 CROPRIGHT=750x1100+1000+0 # this embosses text for B/W scans, so it looks nice when resized on the iliad: #CONVERTOPTS="-trim -blur 1 -threshold 65534" # this embosses text for Grayscans, so it looks nice when resized on the iliad: CONVERTOPTS="-trim -threshold 45000 -gaussian 1x0.3 -threshold 55534" shopt -s extglob mkdir tmp #determine number of pages there.. MAXPAGE=1; while test -f $NAME*(0)$MAXPAGE.TIF; do let MAXPAGE++; done let MAXPAGE-- echo "converting $MAXPAGE pages, this will take a while!" echo "STEP 1 OF 3: preparing pages" for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert $CONVERTOPTS $NAME*(0)$i.TIF tmp/tmp.TIF; convert -crop $CROPLEFT tmp/tmp.TIF tmp/p$((($i-1)*2)).TIF; convert -crop $CROPRIGHT tmp/tmp.TIF tmp/p$((($i-1)*2+1)).TIF; ALL="$ALL tmp/p$((($i-1)*2)).TIF tmp/p$((($i-1)*2+1)).TIF"; done echo "STEP 2 OF 3: adjoining pages" echo ALL=$ALL tiff2ps $ALL > output.ps echo "STEP 3 OF 3: converting/compressing to PDF" DW=$(grep "^%%BoundingBox: " output.ps | cut -d' ' -f 4 | sed -e "s/[^0123456789]//g") DH=$(grep "^%%BoundingBox: " output.ps | cut -d' ' -f 5 | sed -e "s/[^0123456789]//g") ps2pdf -dDEVICEWIDTHPOINTS=$DW -dDEVICEHEIGHTPOINTS=$DH output.ps echo "done!" |
![]() |
![]() |
![]() |
#15 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 325
Karma: 1725
Join Date: Dec 2007
Location: Münster, Germany
Device: iRex iLiad v2
|
Hey Axel,
thanks for new script, I'll give it a try as soon as I have to e another p-book. I'm glad we have some really fast scanning/copying units standing in our university; scanning a page is about 3 times faster than with a usual scanner at home. Scanning a ~250 page book took me only about 30 minutes to scan. I'm already thinking about a way to create a GUI for it. Generally I would just show up a dialog for the rectangle selection and the convert parameters, giving a preview for both. After setting the appropriate options I would feed them to the script to do the processing steps... do you have any other brainstorming ideas? |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRS-600 Reading scanned PDFs | Nikko73 | Sony Reader | 3 | 09-09-2010 08:14 AM |
Looking for eBook for Scanned PDFs | Lady Fitzgerald | Which one should I buy? | 20 | 06-24-2010 10:35 PM |
Best way to view/convert scanned PDFs? | enarchay | 5 | 05-29-2009 06:31 PM | |
Scanned PDFs and Calibre | princeofegypt | Amazon Kindle | 0 | 04-24-2009 01:58 PM |
Huge PDFs and scanned books | janosch | iRex | 3 | 09-19-2006 10:40 AM |