Converting pdf to png images

roger64 · 09-04-2019, 03:54 AM

Hi

In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.

Second try on a 14 pages pdf extract from a bigger book gave this:

Code:

convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.
[roger@lenovo roger]$

It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?

Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?

The second image (001) comes from pdfcandy

Tex2002ans · 09-04-2019, 08:28 PM

Quote:

Originally Posted by roger64

In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.

Good choice.

Quote:

Originally Posted by roger64

Code:

convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.

That warning can probably be completely ignored.

From what I could tell, what's happening is that ICC (color) metadata from the PDF is being embedded in the PNG... (see technical note below).

If you want the warning to go away, and don't care about the metadata, just add a -strip:

Code:

convert -strip garnier.pdf garnier.png

You could continue to add whatever other adjustments you want:

Code:

convert -density 300 -strip garnier.pdf garnier.png

You could also remove the transparency and make the background white:

Code:

convert -density 300 -strip garnier.pdf -background white -alpha off garnier.png

or even use the mogrify command instead:

Code:

mogrify -format png -density 300 -strip -background white -alpha off garnier.pdf

Side Note: For more info on mogrify and batch processing, see the ol' IMv6 Basic Usage (mogrify).

Quote:

Originally Posted by roger64

It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?

Technical Note: I tested a PDF on my end, and got a similar "RGB color space not permitted" error. When I used:

Code:

identify -verbose output.png

on it and compared the stripped/unstripped PNGs, this was the chunks of metadata that -strip removed:

Spoiler:

I assume the few icc lines were what ImageMagick was warning about.

The PNG itself says it's grayscale, but the embedded ICC metadata within the PNG was trying to say it was some sort of sRGB.

Probably carryovers from the PDF metadata when the original person generated/scanned those in.

Quote:

Originally Posted by roger64

Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?

... who knows what kinds of commands they run on that online service. With ImageMagick, you control the entire workflow.

And every PDF is going to be different, so you may need to do different kinds of tweaks for different things (DPI, speckling cleanup, etc.).

ImageMagick Note: PNG is lossless... so -quality on PNG only changes how much compression it's running on the file.

JPG is lossy, so -quality is a sliding scale from 1-100 on how hideous you want the images to be.

ImageMagick's page on -quality for more info.

roger64 · 09-05-2019, 02:02 AM

@Tex2002ans

Thank you so much for your comments which comfort me using Imagemagick and png format for the task at hand. I shall trust Imagemagick outputs when using basic parameters above (quality, density) and leave aside all others options that could possibly lower down the image quality (as cleanup).

After all, the only goal of this stage is to get a png image that can be later processed with scantailor.

I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.

Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.

Tex2002ans · 09-05-2019, 02:26 AM

Quote:

Originally Posted by roger64

After all, the only goal of this stage is to get a png image that can be later processed with scantailor.

And have you been using Scan Tailor Advanced?

https://github.com/4lex4/scantailor-advanced/releases

It includes all the enhancements from all the different Scan Tailor forks over the years:

https://github.com/4lex4/scantailor-...ed#description

Quote:

Originally Posted by roger64

I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.

Yeah, it'll be completely different depending on the PDFs: What DPI they were originally scanned at, vector/bitmapped, color, markings, etc.

Like one of the books I was working on (problem still not solved) had vertical lines slashed right through the middle (along with an incredibly low resolution scan).

Quote:

Originally Posted by roger64

Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.

You could also output as PDF->TIFF straight from ImageMagick, but the workflow you're using seems fine. I also prefer outputting to PNGs.

roger64 · 09-05-2019, 01:08 PM

Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable... I am quite happy with the "experimental" version from the Arch repository.

After it, I get quite good results with Tesseract OCR.

Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage

Tex2002ans · 09-05-2019, 09:56 PM

Quote:

Originally Posted by roger64

Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.

Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.

But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded).

Quote:

Originally Posted by roger64

Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage

Yuck, looks about as bad as mine... but yours can be solved!

I was able to follow most of the steps here:

Removing noise from scanned text document

Step 1

Get the PDF into PNGs:

Code:

convert -density 300 input.pdf output.png

Click image for larger version

Name: page22.png
Views: 445
Size: 112.3 KB
ID: 173266

Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:

Code:

convert -density 300 input.pdf -trim output.png

that would focus more on the text itself.

Click image for larger version

Name: [Step1]page22-trim.png
Views: 503
Size: 97.2 KB
ID: 173267

Alternate #1: You could also use the magick.exe command:

Code:

magick.exe -density 300 input.pdf -trim .\output-%d.png

ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:

Code:

magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png

Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Step 2

Now, I followed much of that forum post above.

Code:

convert output.png -connected-components 4 -threshold 0 -negate output-negate.png

Step 3

It seems like area-threshold looks for "chunks of pixels that are X pixels or less".

I tested with area-threshold=30:

Code:

convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png

but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:

Code:

convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png

Click image for larger version

Name: [Step3]page22-diff.png
Views: 535
Size: 16.8 KB
ID: 173268

Step 4

Then I was able to take image from Step 1 + Step 3 and create a diff:

Code:

convert output.png output-cc80.png -compose minus -composite output-diff.png

Click image for larger version

Name: [Step4]page22-cc80.png
Views: 470
Size: 77.9 KB
ID: 173269

Step 5

Then use the images from Step 1 + Step 4 to remove:

Code:

convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png

Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):

Click image for larger version

Name: [Step5]page22-diff-composite.png
Views: 468
Size: 87.9 KB
ID: 173270

Finalized

Here's a few more before/after pages out of the book:

Click image for larger version

Name: [Before]page10.png
Views: 468
Size: 106.7 KB
ID: 173271

Click image for larger version

Name: [After]page10-diff-composite.png
Views: 471
Size: 94.2 KB
ID: 173272

Click image for larger version

Name: [Before]TOC.png
Views: 703
Size: 68.8 KB
ID: 173273

Click image for larger version

Name: [After]TOC-diff-composite.png
Views: 716
Size: 55.1 KB
ID: 173274

I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.

But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.

Tex2002ans · 09-06-2019, 01:02 AM

Quote:

Originally Posted by Tex2002ans

Code:

magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png

Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Actually, I just figured it out. For this specific set of images, if you add another -trim, it cuts the left/right as well:

Code:

magick.exe -density 300 input.pdf[0-30] -trim -trim .\output-%d.png

And with the double-trimmed images, about 30 pages turned completely black in Step 2, so to get around that, I added a white border. This Step 1 is much better:

Code:

magick.exe -density 300 input.pdf[0-30] -trim -bordercolor white -border 40x40 .\output-%d.png

roger64 · 09-06-2019, 09:49 AM

Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.

I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM).

Tex2002ans · 09-06-2019, 05:58 PM

Quote:

Originally Posted by roger64

Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.

I'll PM you with my:

Finalized ImageMagick files
ScanTailor images (I just skimmed through correcting content areas, etc.)
Finereader PDF

Still hideous OCR... but better than what's currently there.

It's just a bad and low quality scan in the first place...

Quote:

Originally Posted by roger64

I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM)

No wonder Scan Tailor crashes on you, some of this image manipulation takes up tons of GBs of RAM. :P

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help - Images are not being displayed properly after converting pdf to epub/	sy7k	Calibre	0	07-25-2018 07:29 AM
Help - Images are not being displayed properly after converting pdf to epub/mobi/do	sy7k	Calibre	1	07-25-2018 07:15 AM
converting RPG books from PDF to AZW3, messes up images.	Kyris	Conversion	3	11-02-2012 02:35 PM
converting PDF to LRF, images out of order, strange paragraphs... etc...	ReaderZ	Conversion	11	04-01-2012 08:44 PM
Images flipped vertically when converting from PDF	kataleen	Calibre	1	12-16-2010 02:26 AM

09-04-2019, 03:54 AM	#1
roger64 Wizard Posts: 2,626 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Converting pdf to png images Hi In order to pre-process image files with scantailor, I may have to convert some source PDF to png files. There are some online services that do this, I prefer doing it using imagemagick. Second try on a 14 pages pdf extract from a bigger book gave this: Code: convert garnier.pdf garnier.png convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748. [roger@lenovo roger]$ It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it? Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later? The second image (001) comes from pdfcandy Attached Thumbnails Last edited by roger64; 09-04-2019 at 04:22 AM. Reason: quality

09-05-2019, 02:02 AM	#3
roger64 Wizard Posts: 2,626 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@Tex2002ans Thank you so much for your comments which comfort me using Imagemagick and png format for the task at hand. I shall trust Imagemagick outputs when using basic parameters above (quality, density) and leave aside all others options that could possibly lower down the image quality (as cleanup). After all, the only goal of this stage is to get a png image that can be later processed with scantailor. I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf. Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.

09-05-2019, 01:08 PM	#5
roger64 Wizard Posts: 2,626 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable... I am quite happy with the "experimental" version from the Arch repository. After it, I get quite good results with Tesseract OCR. Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage

09-06-2019, 09:49 AM	#8
roger64 Wizard Posts: 2,626 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf. I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM).

Advert

Advert