MobileRead Forums - View Single Post

Tex2002ans · 09-05-2019, 09:56 PM

Quote:

Originally Posted by roger64

Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.

Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.

But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded).

Quote:

Originally Posted by roger64

Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage

Yuck, looks about as bad as mine... but yours can be solved!

I was able to follow most of the steps here:

Removing noise from scanned text document

Step 1

Get the PDF into PNGs:

Code:

convert -density 300 input.pdf output.png

Click image for larger version

Name: page22.png
Views: 419
Size: 112.3 KB
ID: 173266

Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:

Code:

convert -density 300 input.pdf -trim output.png

that would focus more on the text itself.

Click image for larger version

Name: [Step1]page22-trim.png
Views: 474
Size: 97.2 KB
ID: 173267

Alternate #1: You could also use the magick.exe command:

Code:

magick.exe -density 300 input.pdf -trim .\output-%d.png

ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:

Code:

magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png

Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Step 2

Now, I followed much of that forum post above.

Code:

convert output.png -connected-components 4 -threshold 0 -negate output-negate.png

Step 3

It seems like area-threshold looks for "chunks of pixels that are X pixels or less".

I tested with area-threshold=30:

Code:

convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png

but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:

Code:

convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png

Click image for larger version

Name: [Step3]page22-diff.png
Views: 511
Size: 16.8 KB
ID: 173268

Step 4

Then I was able to take image from Step 1 + Step 3 and create a diff:

Code:

convert output.png output-cc80.png -compose minus -composite output-diff.png

Click image for larger version

Name: [Step4]page22-cc80.png
Views: 441
Size: 77.9 KB
ID: 173269

Step 5

Then use the images from Step 1 + Step 4 to remove:

Code:

convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png

Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):

Click image for larger version

Name: [Step5]page22-diff-composite.png
Views: 446
Size: 87.9 KB
ID: 173270

Finalized

Here's a few more before/after pages out of the book:

Click image for larger version

Name: [Before]page10.png
Views: 435
Size: 106.7 KB
ID: 173271

Click image for larger version

Name: [After]page10-diff-composite.png
Views: 441
Size: 94.2 KB
ID: 173272

Click image for larger version

Name: [Before]TOC.png
Views: 643
Size: 68.8 KB
ID: 173273

Click image for larger version

Name: [After]TOC-diff-composite.png
Views: 660
Size: 55.1 KB
ID: 173274

I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.

But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.