![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Converting pdf to png images
Hi
In order to pre-process image files with scantailor, I may have to convert some source PDF to png files. There are some online services that do this, I prefer doing it using imagemagick. Second try on a 14 pages pdf extract from a bigger book gave this: Code:
convert garnier.pdf garnier.png convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748. [roger@lenovo roger]$ Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later? The second image (001) comes from pdfcandy Last edited by roger64; 09-04-2019 at 04:22 AM. Reason: quality |
![]() |
![]() |
![]() |
#2 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
From what I could tell, what's happening is that ICC (color) metadata from the PDF is being embedded in the PNG... (see technical note below). If you want the warning to go away, and don't care about the metadata, just add a -strip: Code:
convert -strip garnier.pdf garnier.png Code:
convert -density 300 -strip garnier.pdf garnier.png Code:
convert -density 300 -strip garnier.pdf -background white -alpha off garnier.png Code:
mogrify -format png -density 300 -strip -background white -alpha off garnier.pdf Quote:
Code:
identify -verbose output.png Spoiler:
I assume the few icc lines were what ImageMagick was warning about. The PNG itself says it's grayscale, but the embedded ICC metadata within the PNG was trying to say it was some sort of sRGB. Probably carryovers from the PDF metadata when the original person generated/scanned those in. Quote:
And every PDF is going to be different, so you may need to do different kinds of tweaks for different things (DPI, speckling cleanup, etc.). ImageMagick Note: PNG is lossless... so -quality on PNG only changes how much compression it's running on the file. JPG is lossy, so -quality is a sliding scale from 1-100 on how hideous you want the images to be. ![]() ImageMagick's page on -quality for more info. Last edited by Tex2002ans; 09-04-2019 at 08:52 PM. |
||||
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@Tex2002ans
Thank you so much for your comments which comfort me using Imagemagick and png format for the task at hand. I shall trust Imagemagick outputs when using basic parameters above (quality, density) and leave aside all others options that could possibly lower down the image quality (as cleanup). After all, the only goal of this stage is to get a png image that can be later processed with scantailor. I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf. Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images. |
![]() |
![]() |
![]() |
#4 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
https://github.com/4lex4/scantailor-advanced/releases It includes all the enhancements from all the different Scan Tailor forks over the years: https://github.com/4lex4/scantailor-...ed#description Quote:
Like one of the books I was working on (problem still not solved) had vertical lines slashed right through the middle (along with an incredibly low resolution scan). You could also output as PDF->TIFF straight from ImageMagick, but the workflow you're using seems fine. I also prefer outputting to PNGs. Last edited by Tex2002ans; 09-05-2019 at 02:28 AM. |
||
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable... I am quite happy with the "experimental" version from the Arch repository.
After it, I get quite good results with Tesseract OCR. Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bp...811.texteImage |
![]() |
![]() |
Advert | |
|
![]() |
#6 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded). Quote:
![]() I was able to follow most of the steps here: Removing noise from scanned text document Step 1 Get the PDF into PNGs: Code:
convert -density 300 input.pdf output.png Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming: Code:
convert -density 300 input.pdf -trim output.png Alternate #1: You could also use the magick.exe command: Code:
magick.exe -density 300 input.pdf -trim .\output-%d.png Code:
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png Step 2 Now, I followed much of that forum post above. Code:
convert output.png -connected-components 4 -threshold 0 -negate output-negate.png It seems like area-threshold looks for "chunks of pixels that are X pixels or less". I tested with area-threshold=30: Code:
convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png Code:
convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png Step 4 Then I was able to take image from Step 1 + Step 3 and create a diff: Code:
convert output.png output-cc80.png -compose minus -composite output-diff.png Step 5 Then use the images from Step 1 + Step 4 to remove: Code:
convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png Finalized Here's a few more before/after pages out of the book: I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest. ![]() But hopefully that'll get you much cleaner input into Scan Tailor + better OCR. ![]() Last edited by Tex2002ans; 09-05-2019 at 11:03 PM. |
||
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Code:
magick.exe -density 300 input.pdf[0-30] -trim -trim .\output-%d.png Code:
magick.exe -density 300 input.pdf[0-30] -trim -bordercolor white -border 40x40 .\output-%d.png |
|
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
![]() Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf. I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM). |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Still hideous OCR... but better than what's currently there. It's just a bad and low quality scan in the first place... No wonder Scan Tailor crashes on you, some of this image manipulation takes up tons of GBs of RAM. :P |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help - Images are not being displayed properly after converting pdf to epub/ | sy7k | Calibre | 0 | 07-25-2018 07:29 AM |
Help - Images are not being displayed properly after converting pdf to epub/mobi/do | sy7k | Calibre | 1 | 07-25-2018 07:15 AM |
converting RPG books from PDF to AZW3, messes up images. | Kyris | Conversion | 3 | 11-02-2012 02:35 PM |
converting PDF to LRF, images out of order, strange paragraphs... etc... | ReaderZ | Conversion | 11 | 04-01-2012 08:44 PM |
Images flipped vertically when converting from PDF | kataleen | Calibre | 1 | 12-16-2010 02:26 AM |