Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 05-28-2020, 08:46 AM   #1
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 222
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Question [SOLVED] How to unlock this PDF file?

Hello,

This PDF file is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All".

I tried "qpdf.exe --decrypt", to no avail. I don't know if cpdf or mutool can help.

Why is that? Is there a way to remove this restriction?

Thank you.

Last edited by Shohreh; 05-28-2020 at 09:53 AM.
Shohreh is offline   Reply With Quote
Old 05-28-2020, 09:53 AM   #2
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 222
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.
Shohreh is offline   Reply With Quote
Old 05-29-2020, 07:45 PM   #3
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Shohreh View Post
Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.
k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang fra -ocrd p protected.pdf

Result attached.
Attached Files
File Type: pdf protected_k2opt.pdf (3.51 MB, 509 views)
willus is offline   Reply With Quote
Old 06-05-2020, 12:51 AM   #4
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 222
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Thanks very much!

https://willus.com/k2pdfopt/help/options.shtml

-mode copy: source pages are simply copied to the output file, but rendered as bitmaps. No trimming or re-sizing is done.

-odpi 200: Set pixels per inch of output screen.

-ocr t: Attempt to use optical character recognition (OCR) in order to embed searchable text into the output PDF document. If followed by t or g, specifies the ocr engine to use (tesseract or gocr).

-ocrlang <set language>: Select the Tesseract OCR Engine language. […] The default language is whatever is in your Tesseract trained data folder. […] Use -ocrlang ? to see the list of Tesseract language files in your Tesseract data folder.

-ocrd p: Set OCR detection type for k2pdfopt and Tesseract. […] For -ocrd p, k2pdfopt passes the entire output page of text to Tesseract and lets Tesseract parse it for word positions.
Shohreh is offline   Reply With Quote
Old 06-06-2020, 11:56 PM   #5
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code:
[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com
    Compiled Jun  7 2020 with Gnu C v10.1.0 for Linux on x64.

** No OCR capability in this compile of k2pdfopt! **
I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.

Last edited by roger64; 06-07-2020 at 02:56 AM. Reason: regrets
roger64 is offline   Reply With Quote
Old 06-07-2020, 10:13 AM   #6
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code:
[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com
    Compiled Jun  7 2020 with Gnu C v10.1.0 for Linux on x64.

** No OCR capability in this compile of k2pdfopt! **
I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.
Do the linux binaries not work on your Linux distro?
willus is offline   Reply With Quote
Old 06-07-2020, 10:37 AM   #7
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code:
[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$
EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.

Last edited by roger64; 06-07-2020 at 11:28 AM.
roger64 is offline   Reply With Quote
Old 06-07-2020, 03:52 PM   #8
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code:
[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$
EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.
k2pdfopt has the tesseract engine compiled in, so it will use what it was compiled with, e.g. v4.0.0 for the latest version. The only support files it needs are the tesseract language training files.
willus is offline   Reply With Quote
Old 06-07-2020, 11:37 PM   #9
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?
Attached Files
File Type: pdf Parquin.pdf (1.30 MB, 474 views)
File Type: pdf Parquin_k2opt.pdf (13.78 MB, 432 views)
File Type: pdf exemple.pdf (37.4 KB, 405 views)

Last edited by roger64; 06-07-2020 at 11:53 PM.
roger64 is offline   Reply With Quote
Old 06-10-2020, 10:31 PM   #10
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?
You ran OCR correctly with Tesseract, but: a couple things--first off, you don't need to do OCR. The original document already has selectable text. Second, both documents you attached allow me to select the text with my PDF viewer--Sumatra PDF running on Windows 10.

Note that there's a bug in k2pdfopt for how it does the selection sizes of the French accented "a". This will be resolved in the next release, which I hope to get out reasonably soon.
willus is offline   Reply With Quote
Old 06-10-2020, 11:00 PM   #11
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Just as an example of some more involved processing, I've attached a conversion with the command below. I ended up running OCR on it because the placement of the original OCR layer was not very good. I included the marked-up version to show how k2pdfopt is parsing the document.

k2pdfopt -cbox1-4,6-7 1.246in,1.428in,11.62in,14.23in -cbox5 1.608in,1.372in,9.792in,16.11in -as -rt 0 -g .2 -col 2 -cgr .6 -ch 2.5 -jfc- -odpi 110 -dev k2 -ocr t -ocrlang fra parquin.pdf -sm
Attached Files
File Type: pdf parquin_k2opt.pdf (1.88 MB, 390 views)
File Type: pdf parquin_marked.pdf (4.86 MB, 436 views)
willus is offline   Reply With Quote
Old 06-11-2020, 10:43 AM   #12
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.

Last edited by roger64; 06-11-2020 at 08:08 PM. Reason: optimize
roger64 is offline   Reply With Quote
Old 06-12-2020, 06:23 AM   #13
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.
Just so you know, you can do all of those pre-processing steps directly in k2pdfopt. The -cmax option adjusts contrast, the -as option will auto-straighten / de-skew, the -g option will adjust gamma factor, which can be used to darken the text, and the -bpc option selects bits-per-color. You can set this to 2 for black and white.
willus is offline   Reply With Quote
Old 06-12-2020, 10:19 AM   #14
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.

roger64 is offline   Reply With Quote
Old 06-12-2020, 03:18 PM   #15
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,312
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.

I released a new version today. I recommend it especially for French OCR.
willus is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Touch ppt file ------> pdf file on NOOK STR kakitpro Barnes & Noble NOOK 5 11-13-2011 09:06 PM
Android How to open PDF file directly in ES File Explorer? thinredline enTourage eDGe 4 06-26-2011 03:10 PM
【Best PDF Size】I find The reason of slowing When Read PDF file linlance Sony Reader 0 03-11-2010 08:13 AM
Remove file path from PDF file DuckDodgers PDF 1 08-13-2006 09:23 AM


All times are GMT -4. The time now is 09:10 AM.


MobileRead.com is a privately owned, operated and funded community.