![]() |
#1 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 20
Karma: 1000010
Join Date: Jul 2015
Device: Kindle Paperwhite v1
|
![]()
Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only.
![]() So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task? Thank you! Last edited by simurq; 07-04-2021 at 04:20 PM. |
![]() |
![]() |
![]() |
#2 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.
|
![]() |
![]() |
![]() |
#3 |
cosiñeiro
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,406
Karma: 2451781
Join Date: Apr 2014
Device: BQ Cervantes 4
|
tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader.
Since you're in Ubuntu I think you can install gImageReader from a ppa. FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying. |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader. Last edited by roger64; 07-04-2021 at 07:19 PM. |
|
![]() |
![]() |
![]() |
#5 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 2268308
Join Date: Nov 2015
Device: none
|
|
![]() |
![]() |
![]() |
#6 |
Veteran Linux user
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 150
Karma: 1000000
Join Date: Mar 2017
Location: Barcelona/Spain
Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote
|
I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already existing pdf files as input.
ocrmypdf.readthedocs.io/en/latest/index.html EDIT: I wrote about it before here: mobileread.com/forums/showthread.php?t=294101 Last edited by orebmur; 07-05-2021 at 05:36 PM. |
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set. It does not do italics (it did, and maybe will do it again) It does not strip headers and footers . My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page. Tesseract is not perfect. It is perfectly usable now for most fiction books. Last edited by roger64; 07-06-2021 at 04:03 AM. Reason: set |
|
![]() |
![]() |
![]() |
#8 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 2268308
Join Date: Nov 2015
Device: none
|
An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces... |
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
|
![]() |
![]() |
![]() |
#10 |
eReader Wrangler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,888
Karma: 52039845
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Tesseract with gImageReader is a really good combination. I didn't know about gImageReader until I read this thread, I've been using YGAF. Huge difference. Not quite sure what you mean when you say Tesseract won't work with PDFs though? Do you mean the whole PDF document at one shot?
Last edited by rcentros; 07-06-2021 at 07:12 AM. |
![]() |
![]() |
![]() |
#11 | |
eReader Wrangler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,888
Karma: 52039845
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
As for headers and footers, just exclude them when you choose your block of text. I'm guessing it's not as sophisticated as FineReader (which I've never seen) but it's still pretty good. |
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.
Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results. If you output to text, you can quickly process a full book. The HOCR format is heavier to handle. I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books... |
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,861
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.
|
![]() |
![]() |
![]() |
#14 | |||
eReader Wrangler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,888
Karma: 52039845
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
Quote:
Quote:
Looking forward to seeing some of the OCR results. Thanks for all the information. |
|||
![]() |
![]() |
![]() |
#15 | |
eReader Wrangler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,888
Karma: 52039845
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Report on Abbyy FineReader OCR Software w/ Canon Lide 60 | 1611mac | Workshop | 6 | 01-27-2012 06:05 PM |
Accessories Hand-held Scanner with OCR Software | Hopi | enTourage Archive | 7 | 01-26-2011 06:40 PM |
OCR Software Help | kpfeifle | Workshop | 5 | 03-01-2010 02:27 PM |
Recommendation for basic scanning software (non OCR) | yunti | Workshop | 1 | 11-27-2009 07:08 AM |
OCR-Software für altdeutsche Schrift | mtravellerh | Software | 9 | 02-19-2009 02:29 PM |