MobileRead Forums - View Single Post - Expert help required : Cleaning bad pdf scans

Hodapp87 · 02-10-2009, 07:59 PM

I would probably do something like...
1) Use pdfimages to extract all images.
2) Open a few images and figure out a color curve that pushes all text to black and most background to white. Save a gradient of this that imagemagick can grok. Figure out where to set some basic crop boxes. This assumes
3) Use imagemagick to crop and split the series of images all at once.
4) Use imagemagick to apply the color curve to all the images.
5) Do something useful with the series of converted images... DjVu, OCR, whatever.

02-10-2009, 07:59 PM	#8
Hodapp87 Junior Member Posts: 6 Karma: 10 Join Date: Jan 2009 Device: Bebook	I would probably do something like... 1) Use pdfimages to extract all images. 2) Open a few images and figure out a color curve that pushes all text to black and most background to white. Save a gradient of this that imagemagick can grok. Figure out where to set some basic crop boxes. This assumes 3) Use imagemagick to crop and split the series of images all at once. 4) Use imagemagick to apply the color curve to all the images. 5) Do something useful with the series of converted images... DjVu, OCR, whatever.