Does anyone have a method for detecting/removing lines in scans WITHIN text?
I think this would really help towards cleaning up PDF scans from Archive.org and elsewhere.
The Problem
(These sample images were cleaned up manually: Before = ScanTailor, After = Manual Cleanup.)
"Easy" Lines
These type of lines occur where the line barely intersects with text:
"Hard" Lines
The worst though, are lines which go through the middle of text.
What I would ultimately aim for is Before/After:
Potential Solutions
This is some of the research I've done so far:
#1: Hough Lines (ImageMagick)
To detect lines within images, I ran across
"Hough-Line Detector" in the ImageMagick Manual.
This site+script shows how Hough Lines can detect lines within images (photographs):
http://www.fmwconcepts.com/imagemagick/houghlines/
https://www.imagemagick.org/discours...ic.php?t=25476
Towards the bottom of the fmwconcepts link, it shows how to detect the horizon (or a fence) by inverting the image.
Maybe there's a way to use inversion to detect vertical lines easier:
Related: And here's a different forum post using Hough Lines to detect page boundaries:
http://www.imagemagick.org/discourse...ic.php?t=31321
(Although I find ScanTailor already does a decent enough job at page boundary removal.)
#2: ImageMagick (Removing Lines Outside of Text)
Some of these examples show removing horizontal/vertical lines... but these are straight lines with no breaking in between:
Remove Vertical Lines for Pre OCR (Tesseract)
These show removing table borders from an image:
https://stackoverflow.com/questions/...s-programmatic
https://stackoverflow.com/questions/...ize-from-image
#3: Baseline Detection + Vertical Line Detection
Perhaps there is a way to combine "baseline detection", with a "vertical line detector"... to at least delete MOST of the vertical line automatically.
Even this would be a huge help over complete manual cleanup.
(I roughly put these images together in GIMP. Still haven't figured out how to get ImageMagick to do this.)
Step 1. Detect Baselines + Expand:
Step 2. Detect Vertical Line (potentially use Hough Lines?):
Step 3. Automatically remove vertical line that DOESN'T overlap the baselines:
While not as good as manual cleanup, this would make the text much more readable + OCR-friendly. See Original + "Automatic" + Manual pages side-by-side:
Related: Here's
ImageMagick being used to add Line Numbers to an image. Perhaps something from here can help detect baselines.
* * * *
Hoping everyone can put their minds together and come up with some solution.