Quote:
Originally Posted by roger64
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.
|
Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.
But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and
much faster than all the previous ones, since it's heavily multi-threaded).
Quote:
Originally Posted by roger64
|
Yuck, looks about as bad as mine... but yours can be solved!
I was able to follow most of the steps here:
Removing noise from scanned text document
Step 1
Get the PDF into PNGs:
Code:
convert -density 300 input.pdf output.png
Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:
Code:
convert -density 300 input.pdf -trim output.png
that would focus more on the text itself.
Alternate #1: You could also use the magick.exe command:
Code:
magick.exe -density 300 input.pdf -trim .\output-%d.png
ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:
Code:
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.
Step 2
Now, I followed much of that forum post above.
Code:
convert output.png -connected-components 4 -threshold 0 -negate output-negate.png
Step 3
It seems like
area-threshold looks for "chunks of pixels that are X pixels or less".
I tested with area-threshold=30:
Code:
convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png
but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:
Code:
convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png
Step 4
Then I was able to take image from
Step 1 + Step 3 and create a diff:
Code:
convert output.png output-cc80.png -compose minus -composite output-diff.png
Step 5
Then use the images from
Step 1 + Step 4 to remove:
Code:
convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png
Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):
Finalized
Here's a few more before/after pages out of the book:
I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.
But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.