Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-12-2018, 05:46 PM   #1
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,047
Karma: 5860915
Join Date: Jul 2012
Device: Nook
Detecting/Removing Vertical Scanlines from Scans

Does anyone have a method for detecting/removing lines in scans WITHIN text?

I think this would really help towards cleaning up PDF scans from Archive.org and elsewhere.

The Problem

(These sample images were cleaned up manually: Before = ScanTailor, After = Manual Cleanup.)

"Easy" Lines

These type of lines occur where the line barely intersects with text:

Click image for larger version

Name:	[Easy]Page.-.0033.png
Views:	29
Size:	10.3 KB
ID:	166881 Click image for larger version

Name:	[Cleaned]Page.-.0033.png
Views:	26
Size:	9.5 KB
ID:	166877

"Hard" Lines

The worst though, are lines which go through the middle of text.

What I would ultimately aim for is Before/After:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	29
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	27
Size:	166.3 KB
ID:	166878

Potential Solutions

This is some of the research I've done so far:

#1: Hough Lines (ImageMagick)

To detect lines within images, I ran across "Hough-Line Detector" in the ImageMagick Manual.

This site+script shows how Hough Lines can detect lines within images (photographs):

http://www.fmwconcepts.com/imagemagick/houghlines/
https://www.imagemagick.org/discours...ic.php?t=25476

Towards the bottom of the fmwconcepts link, it shows how to detect the horizon (or a fence) by inverting the image.

Maybe there's a way to use inversion to detect vertical lines easier:

Click image for larger version

Name:	Page.-.0247[Inversion].png
Views:	22
Size:	212.4 KB
ID:	166883

Related: And here's a different forum post using Hough Lines to detect page boundaries:

http://www.imagemagick.org/discourse...ic.php?t=31321

(Although I find ScanTailor already does a decent enough job at page boundary removal.)

#2: ImageMagick (Removing Lines Outside of Text)

Some of these examples show removing horizontal/vertical lines... but these are straight lines with no breaking in between:

Remove Vertical Lines for Pre OCR (Tesseract)

These show removing table borders from an image:

https://stackoverflow.com/questions/...s-programmatic

https://stackoverflow.com/questions/...ize-from-image

#3: Baseline Detection + Vertical Line Detection

Perhaps there is a way to combine "baseline detection", with a "vertical line detector"... to at least delete MOST of the vertical line automatically.

Even this would be a huge help over complete manual cleanup.

(I roughly put these images together in GIMP. Still haven't figured out how to get ImageMagick to do this.)

Step 1. Detect Baselines + Expand:

Click image for larger version

Name:	Page.-.0247[Baselines].png
Views:	22
Size:	257.2 KB
ID:	166880 Click image for larger version

Name:	Page.-.0247[Baselines2].png
Views:	22
Size:	87.5 KB
ID:	166882

Step 2. Detect Vertical Line (potentially use Hough Lines?):

Click image for larger version

Name:	Page.-.0247[LineDetection].png
Views:	22
Size:	264.4 KB
ID:	166885

Step 3. Automatically remove vertical line that DOESN'T overlap the baselines:

Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	26
Size:	260.6 KB
ID:	166884

While not as good as manual cleanup, this would make the text much more readable + OCR-friendly. See Original + "Automatic" + Manual pages side-by-side:

Click image for larger version

Name:	[ScanTailor]Page.-.0247.png
Views:	29
Size:	143.6 KB
ID:	166879 Click image for larger version

Name:	Page.-.0247[LineBaselineRemoveal].png
Views:	26
Size:	260.6 KB
ID:	166884 Click image for larger version

Name:	[Manual]Page.-.0247.png
Views:	27
Size:	166.3 KB
ID:	166878

Related: Here's ImageMagick being used to add Line Numbers to an image. Perhaps something from here can help detect baselines.

* * * *

Hoping everyone can put their minds together and come up with some solution.

Last edited by Tex2002ans; 10-12-2018 at 06:09 PM.
Tex2002ans is offline   Reply With Quote
Old 10-12-2018, 06:41 PM   #2
BetterRed
null operator
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 11,065
Karma: 10563148
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Have a look for imaging forensic tools. Twenty years ago some of them could detect inconsistent and unusual edges, can't recall if they could remove them - probably. Back then they cost four figures and ran on SG and similar kit.

BR
BetterRed is offline   Reply With Quote
Old Yesterday, 02:03 AM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,047
Karma: 5860915
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by BetterRed View Post
Have a look for imaging forensic tools.
Hmmm... that may be another angle to research.

I know a lot of times they "average" pixel colors of entire rows/columns to get semi-unique fingerprints. Perhaps something like that could be used to detect lines too.

* * *

Tonight I was dabbling a bit more with Hough Lines.

I had quite a bit of success locating the line through the text.

ImageMagick Hough Lines

Original Image:

Click image for larger version

Name:	pg241.png
Views:	5
Size:	143.6 KB
ID:	166906

Step 1: Inverse the scan using ImageMagick's canny (see fmwconcepts link in Post #1):

Code:
convert test.png -canny 0x1+10%+40% test_inverse.png
Click image for larger version

Name:	pg241_inverse.png
Views:	4
Size:	197.4 KB
ID:	166910

Step 2: Then calculate Hough Lines:

From testing, on this specific book, I found a threshold between 500-700 worked:

Spoiler:
Code:
convert -hough-lines 5x5+500 -fill red -transparent white test_inverse.png test_lines_500.png
convert -hough-lines 5x5+550 -fill red -transparent white test_inverse.png test_lines_550.png
convert -hough-lines 5x5+600 -fill red -transparent white test_inverse.png test_lines_600.png
convert -hough-lines 5x5+650 -fill red -transparent white test_inverse.png test_lines_650.png
convert -hough-lines 5x5+700 -fill red -transparent white test_inverse.png test_lines_700.png


Click image for larger version

Name:	pg241_lines_500.png
Views:	7
Size:	171.1 KB
ID:	166911Click image for larger version

Name:	pg241_lines_550.png
Views:	7
Size:	170.8 KB
ID:	166912Click image for larger version

Name:	pg241_lines_600.png
Views:	7
Size:	170.6 KB
ID:	166913

The higher the threshold, the more "false positives" disappeared.

Step 3: Overlay Hough Lines with image:

Spoiler:
Code:
convert test.png ( test_lines_500.png ) -compose over -composite test4_composite_500.png
convert test.png ( test_lines_550.png ) -compose over -composite test4_composite_550.png
convert test.png ( test_lines_600.png ) -compose over -composite test4_composite_600.png
convert test.png ( test_lines_650.png ) -compose over -composite test4_composite_650.png
convert test.png ( test_lines_700.png ) -compose over -composite test4_composite_700.png


Click image for larger version

Name:	pg241_composite_500.png
Views:	6
Size:	401.4 KB
ID:	166907Click image for larger version

Name:	pg241_composite_550.png
Views:	6
Size:	403.4 KB
ID:	166908Click image for larger version

Name:	pg241_composite_600.png
Views:	5
Size:	404.4 KB
ID:	166909

Here's the same steps with another page:

Click image for larger version

Name:	pg093.png
Views:	7
Size:	149.9 KB
ID:	166902Click image for larger version

Name:	pg093_inverse.png
Views:	4
Size:	206.1 KB
ID:	166904Click image for larger version

Name:	pg093_lines_600.png
Views:	4
Size:	76.0 KB
ID:	166905Click image for larger version

Name:	pg093_composite_600.png
Views:	5
Size:	239.0 KB
ID:	166903

* * *

Side Note: To see what a Hough Line calculation is actually doing, I found this part of the video did a decent job explaining it visually:

https://youtu.be/4zHbI-fFIlI?t=219

It goes row-by-row detecting each white pixel, then spins a line in a 360. Plotting this leads to points of various strength (which tells you probable locations + angles of lines).

Last edited by Tex2002ans; Yesterday at 01:45 PM.
Tex2002ans is offline   Reply With Quote
Old Yesterday, 10:39 AM   #4
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 1,673
Karma: 11766398
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 6/5/iPad 1,2 & Air/Surface Pro/Kindle PW
They have some image correction/ocr software over at diybookscanner.org that may be of help...it's some pretty amazing stuff.
Turtle91 is offline   Reply With Quote
Old Yesterday, 07:34 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,047
Karma: 5860915
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by Turtle91 View Post
They have some image correction/ocr software over at diybookscanner.org that may be of help...it's some pretty amazing stuff.
I'll definitely have to check out their forums again. Haven't visited them in a few years, perhaps someone tackled a similar issue.

I've been using Scan Tailor Advanced lately, and that's initially how I cleaned up those "Original" images. It does a fantastic job cropping and correcting the distortions. (And runs so much faster/better than the original Scan Tailor OR Scan Tailor Enhanced.)

... but this line-through-text issue has niggling away at me for a while now. Usually it's not so bad, and I can manually clean it, but this particular book scan had a vertical line on EVERY even page.
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Scans für den Reader Hokuspokus E-Books 41 12-07-2013 04:06 PM
Splitting Landscape Scans Devlar Workshop 8 09-04-2013 03:21 PM
on using dictionaries with pdf scans teofrast PocketBook 2 01-27-2011 04:15 PM
PDF Book Scans? jalm1 Sony Reader 2 02-05-2007 04:48 PM


All times are GMT -4. The time now is 02:01 PM.


MobileRead.com is a privately owned, operated and funded community.