10-12-2018, 05:46 PM | #1 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Detecting/Removing Vertical Scanlines from Scans
Does anyone have a method for detecting/removing lines in scans WITHIN text?
I think this would really help towards cleaning up PDF scans from Archive.org and elsewhere. The Problem (These sample images were cleaned up manually: Before = ScanTailor, After = Manual Cleanup.) "Easy" Lines These type of lines occur where the line barely intersects with text: "Hard" Lines The worst though, are lines which go through the middle of text. What I would ultimately aim for is Before/After: Potential Solutions This is some of the research I've done so far: #1: Hough Lines (ImageMagick) To detect lines within images, I ran across "Hough-Line Detector" in the ImageMagick Manual. This site+script shows how Hough Lines can detect lines within images (photographs): http://www.fmwconcepts.com/imagemagick/houghlines/ https://www.imagemagick.org/discours...ic.php?t=25476 Towards the bottom of the fmwconcepts link, it shows how to detect the horizon (or a fence) by inverting the image. Maybe there's a way to use inversion to detect vertical lines easier: Related: And here's a different forum post using Hough Lines to detect page boundaries: http://www.imagemagick.org/discourse...ic.php?t=31321 (Although I find ScanTailor already does a decent enough job at page boundary removal.) #2: ImageMagick (Removing Lines Outside of Text) Some of these examples show removing horizontal/vertical lines... but these are straight lines with no breaking in between: Remove Vertical Lines for Pre OCR (Tesseract) These show removing table borders from an image: https://stackoverflow.com/questions/...s-programmatic https://stackoverflow.com/questions/...ize-from-image #3: Baseline Detection + Vertical Line Detection Perhaps there is a way to combine "baseline detection", with a "vertical line detector"... to at least delete MOST of the vertical line automatically. Even this would be a huge help over complete manual cleanup. (I roughly put these images together in GIMP. Still haven't figured out how to get ImageMagick to do this.) Step 1. Detect Baselines + Expand: Step 2. Detect Vertical Line (potentially use Hough Lines?): Step 3. Automatically remove vertical line that DOESN'T overlap the baselines: While not as good as manual cleanup, this would make the text much more readable + OCR-friendly. See Original + "Automatic" + Manual pages side-by-side: Related: Here's ImageMagick being used to add Line Numbers to an image. Perhaps something from here can help detect baselines. * * * * Hoping everyone can put their minds together and come up with some solution. Last edited by Tex2002ans; 10-12-2018 at 06:09 PM. |
10-12-2018, 06:41 PM | #2 |
null operator (he/him)
Posts: 20,570
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Have a look for imaging forensic tools. Twenty years ago some of them could detect inconsistent and unusual edges, can't recall if they could remove them - probably. Back then they cost four figures and ran on SG and similar kit.
BR |
Advert | |
|
10-14-2018, 02:03 AM | #3 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Hmmm... that may be another angle to research.
I know a lot of times they "average" pixel colors of entire rows/columns to get semi-unique fingerprints. Perhaps something like that could be used to detect lines too. * * * Tonight I was dabbling a bit more with Hough Lines. I had quite a bit of success locating the line through the text. ImageMagick Hough Lines Original Image: Step 1: Inverse the scan using ImageMagick's canny (see fmwconcepts link in Post #1): Code:
convert test.png -canny 0x1+10%+40% test_inverse.png Step 2: Then calculate Hough Lines: From testing, on this specific book, I found a threshold between 500-700 worked: Spoiler:
The higher the threshold, the more "false positives" disappeared. Step 3: Overlay Hough Lines with image: Spoiler:
Here's the same steps with another page: * * * Side Note: To see what a Hough Line calculation is actually doing, I found this part of the video did a decent job explaining it visually: https://youtu.be/4zHbI-fFIlI?t=219 It goes row-by-row detecting each white pixel, then spins a line in a 360. Plotting this leads to points of various strength (which tells you probable locations + angles of lines). Last edited by Tex2002ans; 10-14-2018 at 01:45 PM. |
10-14-2018, 10:39 AM | #4 |
A Hairy Wizard
Posts: 3,095
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
They have some image correction/ocr software over at diybookscanner.org that may be of help...it's some pretty amazing stuff.
|
10-14-2018, 07:34 PM | #5 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I've been using Scan Tailor Advanced lately, and that's initially how I cleaned up those "Original" images. It does a fantastic job cropping and correcting the distortions. (And runs so much faster/better than the original Scan Tailor OR Scan Tailor Enhanced.) ... but this line-through-text issue has niggling away at me for a while now. Usually it's not so bad, and I can manually clean it, but this particular book scan had a vertical line on EVERY even page. |
|
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Scans für den Reader | Hokuspokus | E-Books | 41 | 12-07-2013 04:06 PM |
Splitting Landscape Scans | Devlar | Workshop | 8 | 09-04-2013 03:21 PM |
on using dictionaries with pdf scans | teofrast | PocketBook | 2 | 01-27-2011 04:15 PM |
PDF Book Scans? | jalm1 | Sony Reader | 2 | 02-05-2007 04:48 PM |