Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-04-2019, 04:54 AM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,403
Karma: 2457001
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Converting pdf to png images

Hi

In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.

Second try on a 14 pages pdf extract from a bigger book gave this:

Code:
convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.
[roger@lenovo roger]$
It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?

Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?

The second image (001) comes from pdfcandy
Attached Thumbnails
Click image for larger version

Name:	garnier-0.png
Views:	48
Size:	25.6 KB
ID:	173226   Click image for larger version

Name:	garnier_p001.png
Views:	46
Size:	54.3 KB
ID:	173227  

Last edited by roger64; 09-04-2019 at 05:22 AM. Reason: quality
roger64 is offline   Reply With Quote
Old 09-04-2019, 09:28 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,370
Karma: 6862783
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by roger64 View Post
In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.
Good choice.

Quote:
Originally Posted by roger64 View Post
Code:
convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.
That warning can probably be completely ignored.

From what I could tell, what's happening is that ICC (color) metadata from the PDF is being embedded in the PNG... (see technical note below).

If you want the warning to go away, and don't care about the metadata, just add a -strip:

Code:
convert -strip garnier.pdf garnier.png
You could continue to add whatever other adjustments you want:

Code:
convert -density 300 -strip garnier.pdf garnier.png
You could also remove the transparency and make the background white:

Code:
convert -density 300 -strip garnier.pdf -background white -alpha off garnier.png
or even use the mogrify command instead:

Code:
mogrify -format png -density 300 -strip -background white -alpha off garnier.pdf
Side Note: For more info on mogrify and batch processing, see the ol' IMv6 Basic Usage (mogrify).

Quote:
Originally Posted by roger64 View Post
It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?
Technical Note: I tested a PDF on my end, and got a similar "RGB color space not permitted" error. When I used:

Code:
identify -verbose output.png
on it and compared the stripped/unstripped PNGs, this was the chunks of metadata that -strip removed:

Spoiler:
Code:
  Resolution: 300x300
  Print size: 8.5x11
  [...]
    icc:copyright: Copyright Artifex Software 2011
    icc:description: Artifex Software sRGB ICC Profile
    pdf:Version: PDF-1.5 
	[...]
    png:bKGD: chunk was found (see Background color, above)
    png:pHYs: x_res=300, y_res=300, units=0
    png:text: 4 tEXt/zTXt/iTXt chunks were found
    png:text-encoded profiles: 1 were found
    png:tIME: 2019-09-04T23:45:09Z
	[...]
  Profiles:
    Profile-icc: 2576 bytes
	[...]


I assume the few icc lines were what ImageMagick was warning about.

The PNG itself says it's grayscale, but the embedded ICC metadata within the PNG was trying to say it was some sort of sRGB.

Probably carryovers from the PDF metadata when the original person generated/scanned those in.

Quote:
Originally Posted by roger64 View Post
Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?
... who knows what kinds of commands they run on that online service. With ImageMagick, you control the entire workflow.

And every PDF is going to be different, so you may need to do different kinds of tweaks for different things (DPI, speckling cleanup, etc.).

ImageMagick Note: PNG is lossless... so -quality on PNG only changes how much compression it's running on the file.

JPG is lossy, so -quality is a sliding scale from 1-100 on how hideous you want the images to be.

ImageMagick's page on -quality for more info.

Last edited by Tex2002ans; 09-04-2019 at 09:52 PM.
Tex2002ans is offline   Reply With Quote
Advert
Old 09-05-2019, 03:02 AM   #3
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,403
Karma: 2457001
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@Tex2002ans

Thank you so much for your comments which comfort me using Imagemagick and png format for the task at hand. I shall trust Imagemagick outputs when using basic parameters above (quality, density) and leave aside all others options that could possibly lower down the image quality (as cleanup).

After all, the only goal of this stage is to get a png image that can be later processed with scantailor.

I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.

Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.
roger64 is offline   Reply With Quote
Old 09-05-2019, 03:26 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,370
Karma: 6862783
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by roger64 View Post
After all, the only goal of this stage is to get a png image that can be later processed with scantailor.
And have you been using Scan Tailor Advanced?

https://github.com/4lex4/scantailor-advanced/releases

It includes all the enhancements from all the different Scan Tailor forks over the years:

https://github.com/4lex4/scantailor-...ed#description

Quote:
Originally Posted by roger64 View Post
I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.
Yeah, it'll be completely different depending on the PDFs: What DPI they were originally scanned at, vector/bitmapped, color, markings, etc.

Like one of the books I was working on (problem still not solved) had vertical lines slashed right through the middle (along with an incredibly low resolution scan).

Quote:
Originally Posted by roger64 View Post
Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.
You could also output as PDF->TIFF straight from ImageMagick, but the workflow you're using seems fine. I also prefer outputting to PNGs.

Last edited by Tex2002ans; 09-05-2019 at 03:28 AM.
Tex2002ans is offline   Reply With Quote
Old 09-05-2019, 02:08 PM   #5
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,403
Karma: 2457001
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable... I am quite happy with the "experimental" version from the Arch repository.

After it, I get quite good results with Tesseract OCR.

Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage
roger64 is offline   Reply With Quote
Advert
Old 09-05-2019, 10:56 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,370
Karma: 6862783
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by roger64 View Post
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.
Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.

But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded).

Quote:
Originally Posted by roger64 View Post
Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage
Yuck, looks about as bad as mine... but yours can be solved!

I was able to follow most of the steps here:

Removing noise from scanned text document

Step 1

Get the PDF into PNGs:

Code:
convert -density 300 input.pdf output.png
Click image for larger version

Name:	page22.png
Views:	19
Size:	112.3 KB
ID:	173266

Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:

Code:
convert -density 300 input.pdf -trim output.png
that would focus more on the text itself.

Click image for larger version

Name:	[Step1]page22-trim.png
Views:	22
Size:	97.2 KB
ID:	173267

Alternate #1: You could also use the magick.exe command:

Code:
magick.exe -density 300 input.pdf -trim .\output-%d.png
ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:

Code:
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Step 2

Now, I followed much of that forum post above.

Code:
convert output.png -connected-components 4 -threshold 0 -negate output-negate.png
Step 3

It seems like area-threshold looks for "chunks of pixels that are X pixels or less".

I tested with area-threshold=30:

Code:
convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png
but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:

Code:
convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png
Click image for larger version

Name:	[Step3]page22-diff.png
Views:	16
Size:	16.8 KB
ID:	173268

Step 4

Then I was able to take image from Step 1 + Step 3 and create a diff:

Code:
convert output.png output-cc80.png -compose minus -composite output-diff.png
Click image for larger version

Name:	[Step4]page22-cc80.png
Views:	14
Size:	77.9 KB
ID:	173269

Step 5

Then use the images from Step 1 + Step 4 to remove:

Code:
convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png
Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):

Click image for larger version

Name:	[Step1]page22-trim.png
Views:	22
Size:	97.2 KB
ID:	173267 Click image for larger version

Name:	[Step3]page22-diff.png
Views:	16
Size:	16.8 KB
ID:	173268 Click image for larger version

Name:	[Step5]page22-diff-composite.png
Views:	18
Size:	87.9 KB
ID:	173270

Finalized

Here's a few more before/after pages out of the book:

Click image for larger version

Name:	[Before]page10.png
Views:	19
Size:	106.7 KB
ID:	173271 Click image for larger version

Name:	[After]page10-diff-composite.png
Views:	17
Size:	94.2 KB
ID:	173272
Click image for larger version

Name:	[Before]TOC.png
Views:	18
Size:	68.8 KB
ID:	173273 Click image for larger version

Name:	[After]TOC-diff-composite.png
Views:	18
Size:	55.1 KB
ID:	173274

I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.

But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.
Attached Files
File Type: zip Tex.BAT.ImageMagick.Cleanup.BadSpeckleLines.zip (1.3 KB, 17 views)

Last edited by Tex2002ans; 09-06-2019 at 12:03 AM.
Tex2002ans is offline   Reply With Quote
Old 09-06-2019, 02:02 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,370
Karma: 6862783
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Tex2002ans View Post
Code:
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.
Actually, I just figured it out. For this specific set of images, if you add another -trim, it cuts the left/right as well:

Code:
magick.exe -density 300 input.pdf[0-30] -trim -trim .\output-%d.png
And with the double-trimmed images, about 30 pages turned completely black in Step 2, so to get around that, I added a white border. This Step 1 is much better:

Code:
magick.exe -density 300 input.pdf[0-30] -trim -bordercolor white -border 40x40 .\output-%d.png
Tex2002ans is offline   Reply With Quote
Old 09-06-2019, 10:49 AM   #8
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,403
Karma: 2457001
Join Date: Jan 2009
Device: Kindle PW3 (wifi)


Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.

I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM).
roger64 is offline   Reply With Quote
Old 09-06-2019, 06:58 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,370
Karma: 6862783
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by roger64 View Post
Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.
I'll PM you with my:
  • Finalized ImageMagick files
  • ScanTailor images (I just skimmed through correcting content areas, etc.)
  • Finereader PDF

Still hideous OCR... but better than what's currently there.

It's just a bad and low quality scan in the first place...

Quote:
Originally Posted by roger64 View Post
I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM)
No wonder Scan Tailor crashes on you, some of this image manipulation takes up tons of GBs of RAM. :P
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help - Images are not being displayed properly after converting pdf to epub/ sy7k Calibre 0 07-25-2018 08:29 AM
Help - Images are not being displayed properly after converting pdf to epub/mobi/do sy7k Calibre 1 07-25-2018 08:15 AM
converting RPG books from PDF to AZW3, messes up images. Kyris Conversion 3 11-02-2012 03:35 PM
converting PDF to LRF, images out of order, strange paragraphs... etc... ReaderZ Conversion 11 04-01-2012 09:44 PM
Images flipped vertically when converting from PDF kataleen Calibre 1 12-16-2010 03:26 AM


All times are GMT -4. The time now is 06:30 AM.


MobileRead.com is a privately owned, operated and funded community.