How to unlock this PDF file?

Shohreh · 05-28-2020, 08:46 AM

Hello,

This PDF file is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All".

I tried "qpdf.exe --decrypt", to no avail. I don't know if cpdf or mutool can help.

Why is that? Is there a way to remove this restriction?

Thank you.

Shohreh · 05-28-2020, 09:53 AM

Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.

willus · 05-29-2020, 07:45 PM

Quote:

Originally Posted by Shohreh

Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.

k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang fra -ocrd p protected.pdf

Result attached.

Shohreh · 06-05-2020, 12:51 AM

Thanks very much!

https://willus.com/k2pdfopt/help/options.shtml

-mode copy: source pages are simply copied to the output file, but rendered as bitmaps. No trimming or re-sizing is done.

-odpi 200: Set pixels per inch of output screen.

-ocr t: Attempt to use optical character recognition (OCR) in order to embed searchable text into the output PDF document. If followed by t or g, specifies the ocr engine to use (tesseract or gocr).

-ocrlang <set language>: Select the Tesseract OCR Engine language. […] The default language is whatever is in your Tesseract trained data folder. […] Use -ocrlang ? to see the list of Tesseract language files in your Tesseract data folder.

-ocrd p: Set OCR detection type for k2pdfopt and Tesseract. […] For -ocrd p, k2pdfopt passes the entire output page of text to Tesseract and lets Tesseract parse it for word positions.

roger64 · 06-06-2020, 11:56 PM

Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code:

[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com
    Compiled Jun  7 2020 with Gnu C v10.1.0 for Linux on x64.

** No OCR capability in this compile of k2pdfopt! **

I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.

willus · 06-07-2020, 10:13 AM

Quote:

Originally Posted by roger64

Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code:

[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com
    Compiled Jun  7 2020 with Gnu C v10.1.0 for Linux on x64.

** No OCR capability in this compile of k2pdfopt! **

I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.

Do the linux binaries not work on your Linux distro?

roger64 · 06-07-2020, 10:37 AM

Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code:

[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$

EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.

willus · 06-07-2020, 03:52 PM

Quote:

Originally Posted by roger64

Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code:

[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$

EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.

k2pdfopt has the tesseract engine compiled in, so it will use what it was compiled with, e.g. v4.0.0 for the latest version. The only support files it needs are the tesseract language training files.

roger64 · 06-07-2020, 11:37 PM

@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?

willus · 06-10-2020, 10:31 PM

Quote:

Originally Posted by roger64

@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?

You ran OCR correctly with Tesseract, but: a couple things--first off, you don't need to do OCR. The original document already has selectable text. Second, both documents you attached allow me to select the text with my PDF viewer--Sumatra PDF running on Windows 10.

Note that there's a bug in k2pdfopt for how it does the selection sizes of the French accented "a". This will be resolved in the next release, which I hope to get out reasonably soon.

willus · 06-10-2020, 11:00 PM

Just as an example of some more involved processing, I've attached a conversion with the command below. I ended up running OCR on it because the placement of the original OCR layer was not very good. I included the marked-up version to show how k2pdfopt is parsing the document.

k2pdfopt -cbox1-4,6-7 1.246in,1.428in,11.62in,14.23in -cbox5 1.608in,1.372in,9.792in,16.11in -as -rt 0 -g .2 -col 2 -cgr .6 -ch 2.5 -jfc- -odpi 110 -dev k2 -ocr t -ocrlang fra parquin.pdf -sm

roger64 · 06-11-2020, 10:43 AM

@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.

willus · 06-12-2020, 06:23 AM

Quote:

Originally Posted by roger64

@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.

Just so you know, you can do all of those pre-processing steps directly in k2pdfopt. The -cmax option adjusts contrast, the -as option will auto-straighten / de-skew, the -g option will adjust gamma factor, which can be used to darken the text, and the -bpc option selects bits-per-color. You can set this to 2 for black and white.

roger64 · 06-12-2020, 10:19 AM

That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.

willus · 06-12-2020, 03:18 PM

Quote:

Originally Posted by roger64

That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.

I released a new version today. I recommend it especially for French OCR.

05-28-2020, 08:46 AM	#1
Shohreh Addict Posts: 206 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	[SOLVED] How to unlock this PDF file? Hello, This PDF file is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All". I tried "qpdf.exe --decrypt", to no avail. I don't know if cpdf or mutool can help. Why is that? Is there a way to remove this restriction? Thank you. Last edited by Shohreh; 05-28-2020 at 09:53 AM.

06-06-2020, 11:56 PM	#5
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Thanks for this interesting tip. I am an Archlinux user. I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR. I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message: Code: [...] k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com Compiled Jun 7 2020 with Gnu C v10.1.0 for Linux on x64. No OCR capability in this compile of k2pdfopt! I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now. Last edited by roger64; 06-07-2020 at 02:56 AM. Reason: regrets

06-07-2020, 10:37 AM	#7
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019. I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it? Code: [roger@lenovo ~]$ tesseract -v tesseract 4.1.1 leptonica-1.79.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 [roger@lenovo ~]$ EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow. Last edited by roger64; 06-07-2020 at 11:28 AM.

06-11-2020, 10:43 AM	#12
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@willus Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. . I shall look for a better viewer on Linux... Sumatra works well with Wine. As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results. pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan) post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives. Last edited by roger64; 06-11-2020 at 08:08 PM. Reason: optimize

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Touch ppt file ------> pdf file on NOOK STR	kakitpro	Barnes & Noble NOOK	5	11-13-2011 09:06 PM
Android How to open PDF file directly in ES File Explorer?	thinredline	enTourage eDGe	4	06-26-2011 03:10 PM
【Best PDF Size】I find The reason of slowing When Read PDF file	linlance	Sony Reader	0	03-11-2010 08:13 AM
Remove file path from PDF file	DuckDodgers	PDF	1	08-13-2006 09:23 AM

05-28-2020, 09:53 AM	#2
Shohreh Addict Posts: 206 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in. So the only solution would be to run it through an OCR… which is too much work to get a clean layout.

06-05-2020, 12:51 AM	#4
Shohreh Addict Posts: 206 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks very much! https://willus.com/k2pdfopt/help/options.shtml -mode copy: source pages are simply copied to the output file, but rendered as bitmaps. No trimming or re-sizing is done. -odpi 200: Set pixels per inch of output screen. -ocr t: Attempt to use optical character recognition (OCR) in order to embed searchable text into the output PDF document. If followed by t or g, specifies the ocr engine to use (tesseract or gocr). -ocrlang <set language>: Select the Tesseract OCR Engine language. […] The default language is whatever is in your Tesseract trained data folder. […] Use -ocrlang ? to see the list of Tesseract language files in your Tesseract data folder. -ocrd p: Set OCR detection type for k2pdfopt and Tesseract. […] For -ocrd p, k2pdfopt passes the entire output page of text to Tesseract and lets Tesseract parse it for word positions.

06-12-2020, 10:19 AM	#14
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract. My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.