View Single Post
Old 09-05-2019, 09:56 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by roger64 View Post
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.
Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.

But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded).

Quote:
Originally Posted by roger64 View Post
Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage
Yuck, looks about as bad as mine... but yours can be solved!

I was able to follow most of the steps here:

Removing noise from scanned text document

Step 1

Get the PDF into PNGs:

Code:
convert -density 300 input.pdf output.png
Click image for larger version

Name:	page22.png
Views:	417
Size:	112.3 KB
ID:	173266

Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:

Code:
convert -density 300 input.pdf -trim output.png
that would focus more on the text itself.

Click image for larger version

Name:	[Step1]page22-trim.png
Views:	473
Size:	97.2 KB
ID:	173267

Alternate #1: You could also use the magick.exe command:

Code:
magick.exe -density 300 input.pdf -trim .\output-%d.png
ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:

Code:
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Step 2

Now, I followed much of that forum post above.

Code:
convert output.png -connected-components 4 -threshold 0 -negate output-negate.png
Step 3

It seems like area-threshold looks for "chunks of pixels that are X pixels or less".

I tested with area-threshold=30:

Code:
convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png
but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:

Code:
convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png
Click image for larger version

Name:	[Step3]page22-diff.png
Views:	510
Size:	16.8 KB
ID:	173268

Step 4

Then I was able to take image from Step 1 + Step 3 and create a diff:

Code:
convert output.png output-cc80.png -compose minus -composite output-diff.png
Click image for larger version

Name:	[Step4]page22-cc80.png
Views:	438
Size:	77.9 KB
ID:	173269

Step 5

Then use the images from Step 1 + Step 4 to remove:

Code:
convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png
Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):

Click image for larger version

Name:	[Step1]page22-trim.png
Views:	473
Size:	97.2 KB
ID:	173267 Click image for larger version

Name:	[Step3]page22-diff.png
Views:	510
Size:	16.8 KB
ID:	173268 Click image for larger version

Name:	[Step5]page22-diff-composite.png
Views:	446
Size:	87.9 KB
ID:	173270

Finalized

Here's a few more before/after pages out of the book:

Click image for larger version

Name:	[Before]page10.png
Views:	434
Size:	106.7 KB
ID:	173271 Click image for larger version

Name:	[After]page10-diff-composite.png
Views:	438
Size:	94.2 KB
ID:	173272
Click image for larger version

Name:	[Before]TOC.png
Views:	629
Size:	68.8 KB
ID:	173273 Click image for larger version

Name:	[After]TOC-diff-composite.png
Views:	650
Size:	55.1 KB
ID:	173274

I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.

But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.
Attached Files
File Type: zip Tex.BAT.ImageMagick.Cleanup.BadSpeckleLines.zip (1.3 KB, 426 views)

Last edited by Tex2002ans; 09-05-2019 at 11:03 PM.
Tex2002ans is offline   Reply With Quote