![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
![]()
Hello,
This PDF file is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All". I tried "qpdf.exe --decrypt", to no avail. I don't know if cpdf or mutool can help. Why is that? Is there a way to remove this restriction? Thank you. Last edited by Shohreh; 05-28-2020 at 09:53 AM. |
![]() |
![]() |
![]() |
#2 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.
So the only solution would be to run it through an OCR… which is too much work to get a clean layout. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
Result attached. |
|
![]() |
![]() |
![]() |
#4 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Thanks very much!
https://willus.com/k2pdfopt/help/options.shtml -mode copy: source pages are simply copied to the output file, but rendered as bitmaps. No trimming or re-sizing is done. -odpi 200: Set pixels per inch of output screen. -ocr t: Attempt to use optical character recognition (OCR) in order to embed searchable text into the output PDF document. If followed by t or g, specifies the ocr engine to use (tesseract or gocr). -ocrlang <set language>: Select the Tesseract OCR Engine language. […] The default language is whatever is in your Tesseract trained data folder. […] Use -ocrlang ? to see the list of Tesseract language files in your Tesseract data folder. -ocrd p: Set OCR detection type for k2pdfopt and Tesseract. […] For -ocrd p, k2pdfopt passes the entire output page of text to Tesseract and lets Tesseract parse it for word positions. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,623
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Thanks for this interesting tip.
![]() I am an Archlinux user. I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR. I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message: Code:
[...] k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com Compiled Jun 7 2020 with Gnu C v10.1.0 for Linux on x64. ** No OCR capability in this compile of k2pdfopt! ** Last edited by roger64; 06-07-2020 at 02:56 AM. Reason: regrets |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,623
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Hi
I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019. I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it? Code:
[roger@lenovo ~]$ tesseract -v tesseract 4.1.1 leptonica-1.79.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 [roger@lenovo ~]$ Last edited by roger64; 06-07-2020 at 11:28 AM. |
![]() |
![]() |
![]() |
#8 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,623
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@willus
Thanks for your explanations and patience... So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but... Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf. I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal? Last edited by roger64; 06-07-2020 at 11:53 PM. |
![]() |
![]() |
![]() |
#10 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
Note that there's a bug in k2pdfopt for how it does the selection sizes of the French accented "a". This will be resolved in the next release, which I hope to get out reasonably soon. |
|
![]() |
![]() |
![]() |
#11 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Just as an example of some more involved processing, I've attached a conversion with the command below. I ended up running OCR on it because the placement of the original OCR layer was not very good. I included the marked-up version to show how k2pdfopt is parsing the document.
k2pdfopt -cbox1-4,6-7 1.246in,1.428in,11.62in,14.23in -cbox5 1.608in,1.372in,9.792in,16.11in -as -rt 0 -g .2 -col 2 -cgr .6 -ch 2.5 -jfc- -odpi 110 -dev k2 -ocr t -ocrlang fra parquin.pdf -sm |
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,623
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@willus
Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. . ![]() I shall look for a better viewer on Linux... Sumatra works well with Wine. As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results. pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan) post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives. Last edited by roger64; 06-11-2020 at 08:08 PM. Reason: optimize |
![]() |
![]() |
![]() |
#13 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,623
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.
My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it. ![]() |
![]() |
![]() |
![]() |
#15 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Touch ppt file ------> pdf file on NOOK STR | kakitpro | Barnes & Noble NOOK | 5 | 11-13-2011 09:06 PM |
Android How to open PDF file directly in ES File Explorer? | thinredline | enTourage eDGe | 4 | 06-26-2011 03:10 PM |
【Best PDF Size】I find The reason of slowing When Read PDF file | linlance | Sony Reader | 0 | 03-11-2010 08:13 AM |
Remove file path from PDF file | DuckDodgers | 1 | 08-13-2006 09:23 AM |