processing scanned data into nice pdfs.

axel77 · 03-06-2008, 05:57 PM

Hi people! So now I got confident with my iliad I need to fill it with content

As I want to use it for academic purposes there are 3 major sources of data. PDFs you can download, Books you scan yourself, and readers you get already in scanned format.

So I now scanned through a book, and also for the latest course, they offered me directly the PDFs they scanned to use for the iliad they otherwise normally print their course lecture they sell for copycosts to the students.

However the problem is formatting. When I for example scanned the book, I put it front face down on the scanner, 2 sites fitting on an A4 page for one scan. Now I got all the .pngs (I used OmniPage to scan and export it), but I need to split the images, and rotate them.

So is there any tool, that can comfortable clip/rotate a whole batch of images? Ideally one should be able to specify settings for one page, since the clipping region differs from book to book, or scanned paper, but usually stays the same through the whole title.

Right now I got my first 10 page article readable on the iliad, without zooming/panning woes. I exported the PDF I got into a set of png's. Used Gimp to open all of them. Copy/pasted the pages to new images, rotated them, saved them. Imported them back into OmniPage, printed it to pdf. And downloaded it on the Iliad. Whola!

This was do-able with a 10 page article and took a hour! (also with figuring some things out). I could have actually read the article in this time on printed paper instead

But well, who knows of any tool, or knows how to such tasks. I guess it is something many people will encounter when using their iliad....

-Thomas- · 03-06-2008, 07:07 PM

I faced a very similar problem after scanning a book over our university's scanner system. I didn't find a suitable software for that purpose, so I simply wrote one myself

It's basically a Linux bash script that extracts the two pages from the pdf, does some image manipulation like contrast and gamma correction (using ImageMagick) and writes back a PDF file with the optimized version.

If you're interested and have a Linux box around I will upload an pre-alpha version of it...

dcalder · 03-06-2008, 09:09 PM

Two options come to mind, one free and the other not.

The free option (for Windows only) would be Irfanview. Its batch conversion option can be used to crop to specified coordinates, rotate, etc, and it can automatically name the resulting files based on a template. One pass through your files would give you the "odd" pages and a second, with a different set of cropping coordinates, would give you the "even" pages.

The commercial software that I'd recommend is Vuescan, from Hamrick Software, is available for Windows, Mac OS-X, and Linux. Despite its name, it isn't just for use with scanners - it will treat files from disk as if they were from a scanner. Assuming that your files are of uniform size, you can set the area you want it to "scan" on one file and tell it to process the whole directory, naming the resulting files automatically. You could definitely do two runs through the files, one for odd pages and a second for even pages, but I think you might also be able to use the multi-crop option to get both pages in one pass through the files. The only catch here is that I don't think Vuescan supports PNG

so you might have to convert to TIFF or JPEG to process the files (Irfanview will easily batch-convert files).

I love my Vuescan Professional software; I use it to scan everything to either RAW TIFF or DNG (digital negative) format, then use Vuescan again at a later date to process the raw files to produce a multi-page PDF or whatever I need. It's a truly outstanding piece of software with exceptionally good licensing terms (you get the right to install it on up to three computers for a very reasonable fee).

axel77 · 03-07-2008, 03:02 AM

-thomas-, yes I'm interested and have linux around everywhere.

dcalder, thanks I will check out both!

-Thomas- · 03-08-2008, 06:08 PM

OK, here is a bash script for converting scanned PDFs into single page PDFs. Actually, it's more a concept of a bash script, as there's nothing like error-checking etc. So if one of the commands exits with errors, the script will go on as if nothing happened

Usage instructions are inside the script, so please read the comments :-)

Have fun!

axel77 · 03-17-2008, 09:43 AM

-Thomas- Thanks, used it, works very fine! will continue to use it!

(Not that I wouldn't dream about a visual GUI, where you can select the rectangles

but this will work well for the next time!)

axel77 · 03-17-2008, 02:03 PM

Oje, I just noticed the quality of the text/images is greatly reduced in this splitted up pdf in comparison to the original pdf zoomed to fit screen. I hardly cant read it anymore

I guess there are too much compression/decompression processes going on in this process.

Well, I will try how it looks when just using "convert", to manipulate the tiff array, and -adjoin it into a pdf. "convert" has BTW cropping too!

-Thomas- · 03-17-2008, 05:02 PM

Quote:

Originally Posted by axel77

Oje, I just noticed the quality of the text/images is greatly reduced in this splitted up pdf in comparison to the original pdf zoomed to fit screen. I hardly cant read it anymore

I guess there are too much compression/decompression processes going on in this process.

Does it happen if you set CONVERTOPTIONS="" (no image converting)? Maybe it has to do with the dpi of the generated images; I think there is a switch for convert to increase the resolution...

Quote:

Well, I will try how it looks when just using "convert", to manipulate the tiff array, and -adjoin it into a pdf. "convert" has BTW cropping too!

Hmm, I remember there was a reason why I didn't let Imagemagick convert my TIFFs to PDF directly... maybe no TIFF compression support or something?

If you make some improvements or have an interesting hack for the script please let me know... A GUI for it is on my list too, but at the moment I don't have enough time to code it...

axel77 · 03-18-2008, 09:35 AM

You are right! convert -adjoin will create an uncompressed pdf, 1.2GB here for a 350 pages book! (beside taking an enternity)

I played a lot the last days to convert a book I scanned, below is the script with which I get a result I'm quite pleased with.

I scanned the book with 300dpi black&white. I don't know if grayscaling would have given better results, that time I thought the iliad works well with black and white, and so I wanted that... Maybe in future one should pick other settings?

I scanned with omnipage and exported the images as uncompressed TIFs so to not have an quality loss. Problem was: the fonts on the scanned images are sometimes only 1 or 2 pixels wide, and when the iLiad zooms this out to fit the screen, it gets quite ugly. So I apply a very small gaussian blur and reconvert everything that has gotten a bit gray to pitch black again, this gets a nice, a bit more fat font that will look cleanly when zoomed.

tiff2ps conveniently takes multiple images to be adjoined to 1 ps file. Strangely enough tiff2pdf cannot do it (one would think there is the same command processor behind these, but oh well). ps2pdf at the last step automatically compresses the images.

The remaining problem I have with is PAPERSIZE in ps2pdf, it defaults to A4, and I set it to A5 because it aprox. matches the pages of this book I'd like to either set the size in pixels, but it takes inches/72 as argument instead, or ideally want it to keep the size of the trimmed images. Dunno how to do that.

The whole script still takes an hour on my 1.2 Ghz notebook for this 350 pages book, but its worth it, as long the result is something you want

Code:

#!/bin/bash 
# parameter: tif files are prefixed with this:
NAME=schone_
# parameter: crop areas (WIDTHxHEIGHT+OffsetLeft+OffsetTop)
CROPLEFT=1800x2200+10+70
CROPRIGHT=1800x2200+1800+70
# this makes the text a bit bolder, so it looks nice when resized on the iliad:
CONVERTOPTS="-trim -blur 1 -threshold 65534"

shopt -s extglob
mkdir tmp

#determine number of pages there..
MAXPAGE=1;
while test -f $NAME*(0)$MAXPAGE.TIF; do let MAXPAGE++; done
let MAXPAGE--

echo "converting $MAXPAGE pages, this will take a while!"

echo "STEP 1 OF 4: preparing left pages"
for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPLEFT $CONVERTOPTS $NAME*(0)$i.TIF tmp/l$i.TIF; done

echo "STEP 2 OF 4: preparing right pages"
for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPRIGHT $CONVERTOPTS $NAME*(0)$i.TIF tmp/r$i.TIF; done

echo "STEP 3 OF 4: adjoining pages"
ALL=
for ((i = 1; $i <= $MAXPAGE; i++)) do ALL="$ALL tmp/l$i.TIF tmp/r$i.TIF"; done
tiff2ps $ALL > output.ps

echo "STEP 4 OF 4: converting/compressing to PDF"
ps2pdf -sPAPERSIZE=a5 output.ps
echo "done!"

DDHarriman · 03-18-2008, 04:03 PM

Hi

My advice:

1 - Scan your book with the scanner tool for scanning and save in TIFF format. Every tool as a rectangular “just scan this part” possibility, use it to define your scan page. Scan one page, preview, move the rectangular “mark” to the next page and scan. Go to the next “2 pages” and do the process again.
This is slower then scanning 2 pages at a time, but, you get your pages cropped the way you want and turned to the cored side (so no problems with pages in the right direction and the others upside down);

2 - OCR with Omnipage and create your PDF.

You will see it looks slower, but at the long term it becomes faster and you are in control of all the process.

Best regards,

axel77 · 03-18-2008, 07:17 PM

I haven't got acceptable results with OCR... every page requries a lot of hand-work until the text is correct. At least the way omnipage works for me Its not an option for me. (and I did try the google OCR also, it was even worse).. OCR helps you, because its faster than typewriting... but until I get a whole book OCRed I could have easily read it in the same time on conventional paper also.

Don't think scanning with 2 rects will get better image based pdfs, I think we got this solved up quite nicely here. Scan the whole page at once, and let the script pick it appart!

I guess it depends on the size of the text-bundle you want to digitize which method is faster. Scanning 2 pages at once still took me 2 afternoons to scan this 350 pages book. I don't want it to take any longer...

DDHarriman · 03-19-2008, 06:37 AM

Hi

I’m sorry I miss understood.

You say that you scan into Omnipage and Omnipage is a OCR program so I thought you where into doing the OCR into your images and then build the PDF.

If you are just making image PDF’s I still think using the scanner own scanning interface is faster and easier then Using Omnipage.

Best regards,

axel77 · 03-19-2008, 08:25 AM

Quote:

Originally Posted by DDHarriman

If you are just making image PDF’s I still think using the scanner own scanning interface is faster and easier then Using Omnipage.

OK, I'll try that,

axel77 · 03-19-2008, 09:38 AM

If anybody interested, I optimized my script. Now some computation time is safed by doing processor intensive tasks only once with intermediate file, and the resulting pdf file has now exactly the size of the images:

This time I converted a book, I scanned in a year ago, that time gray scaled, required a little different convertion parameters to look nice again, but the end result is aprox. the same, so for my experience it doesn't matter if the original scan is gray or monochrome.

Note you can watch the pages being build up by saying "gthumb tmp" in another shell in the same directory, so you don't have to run through the whole script to see if the pages will look like what you want or if they don't.

Oh Yes the script doesn't clean up, you'll have to do by hand

(Honestly, I don't want it to, since often enough I only want to redo certain steps, commenting the others out)

Code:

#!/bin/bash 
# parameter: tif files are prefixed with this:
NAME=turkle_
# parameter: crop areas (WIDTHxHEIGHT+OffsetLeft+OffsetTop)
CROPLEFT=750x1100+200+0
CROPRIGHT=750x1100+1000+0
# this embosses text for B/W scans, so it looks nice when resized on the iliad:
#CONVERTOPTS="-trim -blur 1 -threshold 65534"
# this embosses text for Grayscans, so it looks nice when resized on the iliad:
CONVERTOPTS="-trim -threshold 45000 -gaussian 1x0.3 -threshold 55534"

shopt -s extglob
mkdir tmp

#determine number of pages there..
MAXPAGE=1;
while test -f $NAME*(0)$MAXPAGE.TIF; do let MAXPAGE++; done
let MAXPAGE--

echo "converting $MAXPAGE pages, this will take a while!"

echo "STEP 1 OF 3: preparing pages"
for ((i = 1; $i <= $MAXPAGE; i++)) do
   echo "@$i of $MAXPAGE";
   convert $CONVERTOPTS $NAME*(0)$i.TIF tmp/tmp.TIF; 
   convert -crop $CROPLEFT tmp/tmp.TIF tmp/p$((($i-1)*2)).TIF; 
   convert -crop $CROPRIGHT tmp/tmp.TIF tmp/p$((($i-1)*2+1)).TIF;
   ALL="$ALL tmp/p$((($i-1)*2)).TIF tmp/p$((($i-1)*2+1)).TIF"; 
done

echo "STEP 2 OF 3: adjoining pages"
echo ALL=$ALL
tiff2ps $ALL > output.ps

echo "STEP 3 OF 3: converting/compressing to PDF"
DW=$(grep "^%%BoundingBox: " output.ps | cut -d' ' -f 4 | sed -e "s/[^0123456789]//g")
DH=$(grep "^%%BoundingBox: " output.ps | cut -d' ' -f 5 | sed -e "s/[^0123456789]//g")
ps2pdf -dDEVICEWIDTHPOINTS=$DW -dDEVICEHEIGHTPOINTS=$DH output.ps
echo "done!"

-Thomas- · 03-19-2008, 02:04 PM

Hey Axel,

thanks for new script, I'll give it a try as soon as I have to e another p-book.

I'm glad we have some really fast scanning/copying units standing in our university; scanning a page is about 3 times faster than with a usual scanner at home. Scanning a ~250 page book took me only about 30 minutes to scan.

I'm already thinking about a way to create a GUI for it. Generally I would just show up a dialog for the rectangle selection and the convert parameters, giving a preview for both. After setting the appropriate options I would feed them to the script to do the processing steps... do you have any other brainstorming ideas?

03-06-2008, 05:57 PM	#1
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	processing scanned data into nice pdfs. Hi people! So now I got confident with my iliad I need to fill it with content As I want to use it for academic purposes there are 3 major sources of data. PDFs you can download, Books you scan yourself, and readers you get already in scanned format. So I now scanned through a book, and also for the latest course, they offered me directly the PDFs they scanned to use for the iliad they otherwise normally print their course lecture they sell for copycosts to the students. However the problem is formatting. When I for example scanned the book, I put it front face down on the scanner, 2 sites fitting on an A4 page for one scan. Now I got all the .pngs (I used OmniPage to scan and export it), but I need to split the images, and rotate them. So is there any tool, that can comfortable clip/rotate a whole batch of images? Ideally one should be able to specify settings for one page, since the clipping region differs from book to book, or scanned paper, but usually stays the same through the whole title. Right now I got my first 10 page article readable on the iliad, without zooming/panning woes. I exported the PDF I got into a set of png's. Used Gimp to open all of them. Copy/pasted the pages to new images, rotated them, saved them. Imported them back into OmniPage, printed it to pdf. And downloaded it on the Iliad. Whola! This was do-able with a 10 page article and took a hour! (also with figuring some things out). I could have actually read the article in this time on printed paper instead But well, who knows of any tool, or knows how to such tasks. I guess it is something many people will encounter when using their iliad....

03-18-2008, 09:35 AM	#9
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	You are right! convert -adjoin will create an uncompressed pdf, 1.2GB here for a 350 pages book! (beside taking an enternity) I played a lot the last days to convert a book I scanned, below is the script with which I get a result I'm quite pleased with. I scanned the book with 300dpi black&white. I don't know if grayscaling would have given better results, that time I thought the iliad works well with black and white, and so I wanted that... Maybe in future one should pick other settings? I scanned with omnipage and exported the images as uncompressed TIFs so to not have an quality loss. Problem was: the fonts on the scanned images are sometimes only 1 or 2 pixels wide, and when the iLiad zooms this out to fit the screen, it gets quite ugly. So I apply a very small gaussian blur and reconvert everything that has gotten a bit gray to pitch black again, this gets a nice, a bit more fat font that will look cleanly when zoomed. tiff2ps conveniently takes multiple images to be adjoined to 1 ps file. Strangely enough tiff2pdf cannot do it (one would think there is the same command processor behind these, but oh well). ps2pdf at the last step automatically compresses the images. The remaining problem I have with is PAPERSIZE in ps2pdf, it defaults to A4, and I set it to A5 because it aprox. matches the pages of this book I'd like to either set the size in pixels, but it takes inches/72 as argument instead, or ideally want it to keep the size of the trimmed images. Dunno how to do that. The whole script still takes an hour on my 1.2 Ghz notebook for this 350 pages book, but its worth it, as long the result is something you want Code: #!/bin/bash # parameter: tif files are prefixed with this: NAME=schone_ # parameter: crop areas (WIDTHxHEIGHT+OffsetLeft+OffsetTop) CROPLEFT=1800x2200+10+70 CROPRIGHT=1800x2200+1800+70 # this makes the text a bit bolder, so it looks nice when resized on the iliad: CONVERTOPTS="-trim -blur 1 -threshold 65534" shopt -s extglob mkdir tmp #determine number of pages there.. MAXPAGE=1; while test -f $NAME(0)$MAXPAGE.TIF; do let MAXPAGE++; done let MAXPAGE-- echo "converting $MAXPAGE pages, this will take a while!" echo "STEP 1 OF 4: preparing left pages" for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPLEFT $CONVERTOPTS $NAME(0)$i.TIF tmp/l$i.TIF; done echo "STEP 2 OF 4: preparing right pages" for ((i = 1; $i <= $MAXPAGE; i++)) do echo "@$i of $MAXPAGE"; convert -crop $CROPRIGHT $CONVERTOPTS $NAME*(0)$i.TIF tmp/r$i.TIF; done echo "STEP 3 OF 4: adjoining pages" ALL= for ((i = 1; $i <= $MAXPAGE; i++)) do ALL="$ALL tmp/l$i.TIF tmp/r$i.TIF"; done tiff2ps $ALL > output.ps echo "STEP 4 OF 4: converting/compressing to PDF" ps2pdf -sPAPERSIZE=a5 output.ps echo "done!"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-600 Reading scanned PDFs	Nikko73	Sony Reader	3	09-09-2010 08:14 AM
Looking for eBook for Scanned PDFs	Lady Fitzgerald	Which one should I buy?	20	06-24-2010 10:35 PM
Best way to view/convert scanned PDFs?	enarchay	PDF	5	05-29-2009 06:31 PM
Scanned PDFs and Calibre	princeofegypt	Amazon Kindle	0	04-24-2009 01:58 PM
Huge PDFs and scanned books	janosch	iRex	3	09-19-2006 10:40 AM

03-06-2008, 07:07 PM	#2
-Thomas- Addict Posts: 325 Karma: 1725 Join Date: Dec 2007 Location: Münster, Germany Device: iRex iLiad v2	I faced a very similar problem after scanning a book over our university's scanner system. I didn't find a suitable software for that purpose, so I simply wrote one myself It's basically a Linux bash script that extracts the two pages from the pdf, does some image manipulation like contrast and gamma correction (using ImageMagick) and writes back a PDF file with the optimized version. If you're interested and have a Linux box around I will upload an pre-alpha version of it...

03-06-2008, 09:09 PM	#3
dcalder Zealot Posts: 127 Karma: 9856 Join Date: Dec 2007 Location: Ontario, Canada Device: Sony PRS-300/Kindle Keyboard/iPad Mini	Two options come to mind, one free and the other not. The free option (for Windows only) would be Irfanview. Its batch conversion option can be used to crop to specified coordinates, rotate, etc, and it can automatically name the resulting files based on a template. One pass through your files would give you the "odd" pages and a second, with a different set of cropping coordinates, would give you the "even" pages. The commercial software that I'd recommend is Vuescan, from Hamrick Software, is available for Windows, Mac OS-X, and Linux. Despite its name, it isn't just for use with scanners - it will treat files from disk as if they were from a scanner. Assuming that your files are of uniform size, you can set the area you want it to "scan" on one file and tell it to process the whole directory, naming the resulting files automatically. You could definitely do two runs through the files, one for odd pages and a second for even pages, but I think you might also be able to use the multi-crop option to get both pages in one pass through the files. The only catch here is that I don't think Vuescan supports PNG so you might have to convert to TIFF or JPEG to process the files (Irfanview will easily batch-convert files). I love my Vuescan Professional software; I use it to scan everything to either RAW TIFF or DNG (digital negative) format, then use Vuescan again at a later date to process the raw files to produce a multi-page PDF or whatever I need. It's a truly outstanding piece of software with exceptionally good licensing terms (you get the right to install it on up to three computers for a very reasonable fee).

03-07-2008, 03:02 AM	#4
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	-thomas-, yes I'm interested and have linux around everywhere. dcalder, thanks I will check out both!

03-17-2008, 09:43 AM	#6
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	-Thomas- Thanks, used it, works very fine! will continue to use it! (Not that I wouldn't dream about a visual GUI, where you can select the rectangles but this will work well for the next time!)

03-17-2008, 02:03 PM	#7
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	Oje, I just noticed the quality of the text/images is greatly reduced in this splitted up pdf in comparison to the original pdf zoomed to fit screen. I hardly cant read it anymore I guess there are too much compression/decompression processes going on in this process. Well, I will try how it looks when just using "convert", to manipulate the tiff array, and -adjoin it into a pdf. "convert" has BTW cropping too!

03-18-2008, 04:03 PM	#10
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hi My advice: 1 - Scan your book with the scanner tool for scanning and save in TIFF format. Every tool as a rectangular “just scan this part” possibility, use it to define your scan page. Scan one page, preview, move the rectangular “mark” to the next page and scan. Go to the next “2 pages” and do the process again. This is slower then scanning 2 pages at a time, but, you get your pages cropped the way you want and turned to the cored side (so no problems with pages in the right direction and the others upside down); 2 - OCR with Omnipage and create your PDF. You will see it looks slower, but at the long term it becomes faster and you are in control of all the process. Best regards,

03-18-2008, 07:17 PM	#11
axel77 Fanatic Posts: 584 Karma: 914 Join Date: Mar 2008 Device: iliad	I haven't got acceptable results with OCR... every page requries a lot of hand-work until the text is correct. At least the way omnipage works for me Its not an option for me. (and I did try the google OCR also, it was even worse).. OCR helps you, because its faster than typewriting... but until I get a whole book OCRed I could have easily read it in the same time on conventional paper also. Don't think scanning with 2 rects will get better image based pdfs, I think we got this solved up quite nicely here. Scan the whole page at once, and let the script pick it appart! I guess it depends on the size of the text-bundle you want to digitize which method is faster. Scanning 2 pages at once still took me 2 afternoons to scan this 350 pages book. I don't want it to take any longer...

03-19-2008, 06:37 AM	#12
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hi I’m sorry I miss understood. You say that you scan into Omnipage and Omnipage is a OCR program so I thought you where into doing the OCR into your images and then build the PDF. If you are just making image PDF’s I still think using the scanner own scanning interface is faster and easier then Using Omnipage. Best regards,

03-19-2008, 02:04 PM	#15
-Thomas- Addict Posts: 325 Karma: 1725 Join Date: Dec 2007 Location: Münster, Germany Device: iRex iLiad v2	Hey Axel, thanks for new script, I'll give it a try as soon as I have to e another p-book. I'm glad we have some really fast scanning/copying units standing in our university; scanning a page is about 3 times faster than with a usual scanner at home. Scanning a ~250 page book took me only about 30 minutes to scan. I'm already thinking about a way to create a GUI for it. Generally I would just show up a dialog for the rectangle selection and the convert parameters, giving a preview for both. After setting the appropriate options I would feed them to the script to do the processing steps... do you have any other brainstorming ideas?

Advert

Advert