View Single Post
Old 10-12-2018, 05:46 PM   #1
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Detecting/Removing Vertical Scanlines from Scans

Does anyone have a method for detecting/removing lines in scans WITHIN text?

I think this would really help towards cleaning up PDF scans from Archive.org and elsewhere.

The Problem

(These sample images were cleaned up manually: Before = ScanTailor, After = Manual Cleanup.)

"Easy" Lines

These type of lines occur where the line barely intersects with text:

Click image for larger version

Name:	[Easy]Page.-.0033.png
Views:	621
Size:	10.3 KB
ID:	166881 Click image for larger version

Name:	[Cleaned]Page.-.0033.png
Views:	549
Size:	9.5 KB
ID:	166877

"Hard" Lines

The worst though, are lines which go through the middle of text.

What I would ultimately aim for is Before/After:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	642
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	625
Size:	166.3 KB
ID:	166878

Potential Solutions

This is some of the research I've done so far:

#1: Hough Lines (ImageMagick)

To detect lines within images, I ran across "Hough-Line Detector" in the ImageMagick Manual.

This site+script shows how Hough Lines can detect lines within images (photographs):

http://www.fmwconcepts.com/imagemagick/houghlines/
https://www.imagemagick.org/discours...ic.php?t=25476

Towards the bottom of the fmwconcepts link, it shows how to detect the horizon (or a fence) by inverting the image.

Maybe there's a way to use inversion to detect vertical lines easier:

Click image for larger version

Name:	Page.-.0247[Inversion].png
Views:	595
Size:	212.4 KB
ID:	166883

Related: And here's a different forum post using Hough Lines to detect page boundaries:

http://www.imagemagick.org/discourse...ic.php?t=31321

(Although I find ScanTailor already does a decent enough job at page boundary removal.)

#2: ImageMagick (Removing Lines Outside of Text)

Some of these examples show removing horizontal/vertical lines... but these are straight lines with no breaking in between:

Remove Vertical Lines for Pre OCR (Tesseract)

These show removing table borders from an image:

https://stackoverflow.com/questions/...s-programmatic

https://stackoverflow.com/questions/...ize-from-image

#3: Baseline Detection + Vertical Line Detection

Perhaps there is a way to combine "baseline detection", with a "vertical line detector"... to at least delete MOST of the vertical line automatically.

Even this would be a huge help over complete manual cleanup.

(I roughly put these images together in GIMP. Still haven't figured out how to get ImageMagick to do this.)

Step 1. Detect Baselines + Expand:

Click image for larger version

Name:	Page.-.0247[Baselines].png
Views:	553
Size:	257.2 KB
ID:	166880 Click image for larger version

Name:	Page.-.0247[Baselines2].png
Views:	544
Size:	87.5 KB
ID:	166882

Step 2. Detect Vertical Line (potentially use Hough Lines?):

Click image for larger version

Name:	Page.-.0247[LineDetection].png
Views:	548
Size:	264.4 KB
ID:	166885

Step 3. Automatically remove vertical line that DOESN'T overlap the baselines:

Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	579
Size:	260.6 KB
ID:	166884

While not as good as manual cleanup, this would make the text much more readable + OCR-friendly. See Original + "Automatic" + Manual pages side-by-side:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	642
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	579
Size:	260.6 KB
ID:	166884 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	625
Size:	166.3 KB
ID:	166878

Related: Here's ImageMagick being used to add Line Numbers to an image. Perhaps something from here can help detect baselines.

* * * *

Hoping everyone can put their minds together and come up with some solution.

Last edited by Tex2002ans; 10-12-2018 at 06:09 PM.
Tex2002ans is offline   Reply With Quote