Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-12-2018, 05:46 PM   #1
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Detecting/Removing Vertical Scanlines from Scans

Does anyone have a method for detecting/removing lines in scans WITHIN text?

I think this would really help towards cleaning up PDF scans from Archive.org and elsewhere.

The Problem

(These sample images were cleaned up manually: Before = ScanTailor, After = Manual Cleanup.)

"Easy" Lines

These type of lines occur where the line barely intersects with text:

Click image for larger version

Name:	[Easy]Page.-.0033.png
Views:	480
Size:	10.3 KB
ID:	166881 Click image for larger version

Name:	[Cleaned]Page.-.0033.png
Views:	417
Size:	9.5 KB
ID:	166877

"Hard" Lines

The worst though, are lines which go through the middle of text.

What I would ultimately aim for is Before/After:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	492
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	482
Size:	166.3 KB
ID:	166878

Potential Solutions

This is some of the research I've done so far:

#1: Hough Lines (ImageMagick)

To detect lines within images, I ran across "Hough-Line Detector" in the ImageMagick Manual.

This site+script shows how Hough Lines can detect lines within images (photographs):

http://www.fmwconcepts.com/imagemagick/houghlines/
https://www.imagemagick.org/discours...ic.php?t=25476

Towards the bottom of the fmwconcepts link, it shows how to detect the horizon (or a fence) by inverting the image.

Maybe there's a way to use inversion to detect vertical lines easier:

Click image for larger version

Name:	Page.-.0247[Inversion].png
Views:	439
Size:	212.4 KB
ID:	166883

Related: And here's a different forum post using Hough Lines to detect page boundaries:

http://www.imagemagick.org/discourse...ic.php?t=31321

(Although I find ScanTailor already does a decent enough job at page boundary removal.)

#2: ImageMagick (Removing Lines Outside of Text)

Some of these examples show removing horizontal/vertical lines... but these are straight lines with no breaking in between:

Remove Vertical Lines for Pre OCR (Tesseract)

These show removing table borders from an image:

https://stackoverflow.com/questions/...s-programmatic

https://stackoverflow.com/questions/...ize-from-image

#3: Baseline Detection + Vertical Line Detection

Perhaps there is a way to combine "baseline detection", with a "vertical line detector"... to at least delete MOST of the vertical line automatically.

Even this would be a huge help over complete manual cleanup.

(I roughly put these images together in GIMP. Still haven't figured out how to get ImageMagick to do this.)

Step 1. Detect Baselines + Expand:

Click image for larger version

Name:	Page.-.0247[Baselines].png
Views:	409
Size:	257.2 KB
ID:	166880 Click image for larger version

Name:	Page.-.0247[Baselines2].png
Views:	392
Size:	87.5 KB
ID:	166882

Step 2. Detect Vertical Line (potentially use Hough Lines?):

Click image for larger version

Name:	Page.-.0247[LineDetection].png
Views:	402
Size:	264.4 KB
ID:	166885

Step 3. Automatically remove vertical line that DOESN'T overlap the baselines:

Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	433
Size:	260.6 KB
ID:	166884

While not as good as manual cleanup, this would make the text much more readable + OCR-friendly. See Original + "Automatic" + Manual pages side-by-side:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	492
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	433
Size:	260.6 KB
ID:	166884 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	482
Size:	166.3 KB
ID:	166878

Related: Here's ImageMagick being used to add Line Numbers to an image. Perhaps something from here can help detect baselines.

* * * *

Hoping everyone can put their minds together and come up with some solution.

Last edited by Tex2002ans; 10-12-2018 at 06:09 PM.
Tex2002ans is offline   Reply With Quote
Old 10-12-2018, 06:41 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Have a look for imaging forensic tools. Twenty years ago some of them could detect inconsistent and unusual edges, can't recall if they could remove them - probably. Back then they cost four figures and ran on SG and similar kit.

BR
BetterRed is offline   Reply With Quote
Old 10-14-2018, 02:03 AM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by BetterRed View Post
Have a look for imaging forensic tools.
Hmmm... that may be another angle to research.

I know a lot of times they "average" pixel colors of entire rows/columns to get semi-unique fingerprints. Perhaps something like that could be used to detect lines too.

* * *

Tonight I was dabbling a bit more with Hough Lines.

I had quite a bit of success locating the line through the text.

ImageMagick Hough Lines

Original Image:

Click image for larger version

Name:	pg241.png
Views:	380
Size:	143.6 KB
ID:	166906

Step 1: Inverse the scan using ImageMagick's canny (see fmwconcepts link in Post #1):

Code:
convert test.png -canny 0x1+10%+40% test_inverse.png
Click image for larger version

Name:	pg241_inverse.png
Views:	383
Size:	197.4 KB
ID:	166910

Step 2: Then calculate Hough Lines:

From testing, on this specific book, I found a threshold between 500-700 worked:

Spoiler:
Code:
convert -hough-lines 5x5+500 -fill red -transparent white test_inverse.png test_lines_500.png
convert -hough-lines 5x5+550 -fill red -transparent white test_inverse.png test_lines_550.png
convert -hough-lines 5x5+600 -fill red -transparent white test_inverse.png test_lines_600.png
convert -hough-lines 5x5+650 -fill red -transparent white test_inverse.png test_lines_650.png
convert -hough-lines 5x5+700 -fill red -transparent white test_inverse.png test_lines_700.png


Click image for larger version

Name:	pg241_lines_500.png
Views:	368
Size:	171.1 KB
ID:	166911Click image for larger version

Name:	pg241_lines_550.png
Views:	411
Size:	170.8 KB
ID:	166912Click image for larger version

Name:	pg241_lines_600.png
Views:	400
Size:	170.6 KB
ID:	166913

The higher the threshold, the more "false positives" disappeared.

Step 3: Overlay Hough Lines with image:

Spoiler:
Code:
convert test.png ( test_lines_500.png ) -compose over -composite test4_composite_500.png
convert test.png ( test_lines_550.png ) -compose over -composite test4_composite_550.png
convert test.png ( test_lines_600.png ) -compose over -composite test4_composite_600.png
convert test.png ( test_lines_650.png ) -compose over -composite test4_composite_650.png
convert test.png ( test_lines_700.png ) -compose over -composite test4_composite_700.png


Click image for larger version

Name:	pg241_composite_500.png
Views:	370
Size:	401.4 KB
ID:	166907Click image for larger version

Name:	pg241_composite_550.png
Views:	374
Size:	403.4 KB
ID:	166908Click image for larger version

Name:	pg241_composite_600.png
Views:	359
Size:	404.4 KB
ID:	166909

Here's the same steps with another page:

Click image for larger version

Name:	pg093.png
Views:	378
Size:	149.9 KB
ID:	166902Click image for larger version

Name:	pg093_inverse.png
Views:	348
Size:	206.1 KB
ID:	166904Click image for larger version

Name:	pg093_lines_600.png
Views:	384
Size:	76.0 KB
ID:	166905Click image for larger version

Name:	pg093_composite_600.png
Views:	371
Size:	239.0 KB
ID:	166903

* * *

Side Note: To see what a Hough Line calculation is actually doing, I found this part of the video did a decent job explaining it visually:

https://youtu.be/4zHbI-fFIlI?t=219

It goes row-by-row detecting each white pixel, then spins a line in a 360. Plotting this leads to points of various strength (which tells you probable locations + angles of lines).

Last edited by Tex2002ans; 10-14-2018 at 01:45 PM.
Tex2002ans is offline   Reply With Quote
Old 10-14-2018, 10:39 AM   #4
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,093
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
They have some image correction/ocr software over at diybookscanner.org that may be of help...it's some pretty amazing stuff.
Turtle91 is offline   Reply With Quote
Old 10-14-2018, 07:34 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Turtle91 View Post
They have some image correction/ocr software over at diybookscanner.org that may be of help...it's some pretty amazing stuff.
I'll definitely have to check out their forums again. Haven't visited them in a few years, perhaps someone tackled a similar issue.

I've been using Scan Tailor Advanced lately, and that's initially how I cleaned up those "Original" images. It does a fantastic job cropping and correcting the distortions. (And runs so much faster/better than the original Scan Tailor OR Scan Tailor Enhanced.)

... but this line-through-text issue has niggling away at me for a while now. Usually it's not so bad, and I can manually clean it, but this particular book scan had a vertical line on EVERY even page.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Scans für den Reader Hokuspokus E-Books 41 12-07-2013 04:06 PM
Splitting Landscape Scans Devlar Workshop 8 09-04-2013 03:21 PM
on using dictionaries with pdf scans teofrast PocketBook 2 01-27-2011 04:15 PM
PDF Book Scans? jalm1 Sony Reader 2 02-05-2007 04:48 PM


All times are GMT -4. The time now is 10:54 PM.


MobileRead.com is a privately owned, operated and funded community.