Quote:
Originally Posted by cacapee
Have you loooked into unpaper?
|
I was in touch with the author a few weeks ago to learn how to pipe jpg images to unpaper and save results as jpg as well. His advice was good for Linux but I have not managed to repeat it in DOS yet (however the DOS version packaged in pdfread works well with ppm/pbm/pgm).
Both versions - Linux and DOS - work nicely in batch mode. After experimenting with some parameters one can clean the image from black spots, lines and blobs with the result being masks (or blocks of text/image) surrounded by whitespace. His algorithm for conversion to black and white is based on the threshold method, which is not sufficient for poor quality originals. One has to keep in mind that even with cleaning parameters adjusted to clean one page, the processing may damage other pages by removing the text as well.
A nice feature of unpaper is splitting of double pages and replacing the dark shadow between the pages or at the margins with whitespace.
I asked the author to consider trimming the white space automatically once the program recognized the masks. He may do that in the future but not too soon.
For now, it is a good free preprocessing tool for pdflrf.