07-04-2021, 04:10 PM | #1 |
Member
Posts: 20
Karma: 1000010
Join Date: Jul 2015
Device: Kindle Paperwhite v1
|
Any recommended OCR software for Linux?
Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only. No worries on the "back-end" side, since I can still use Calibre or Sigil, which support Linux out of the box... I heard about Tesseract but would like to hear from veteran book developers if they really recommend it, or is it about a typical 'beggars can't be choosers' sort of thing?
So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task? Thank you! Last edited by simurq; 07-04-2021 at 04:20 PM. |
07-04-2021, 04:21 PM | #2 |
Evangelist
Posts: 485
Karma: 2267928
Join Date: Nov 2015
Device: none
|
Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.
|
Advert | |
|
07-04-2021, 06:08 PM | #3 |
cosiñeiro
Posts: 1,294
Karma: 2200073
Join Date: Apr 2014
Device: BQ Cervantes 4
|
tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader.
Since you're in Ubuntu I think you can install gImageReader from a ppa. FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying. |
07-04-2021, 07:16 PM | #4 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader. Last edited by roger64; 07-04-2021 at 07:19 PM. |
|
07-05-2021, 03:18 PM | #5 |
Evangelist
Posts: 485
Karma: 2267928
Join Date: Nov 2015
Device: none
|
|
Advert | |
|
07-05-2021, 05:31 PM | #6 |
Veteran Linux user
Posts: 144
Karma: 678910
Join Date: Mar 2017
Location: Barcelona/Spain
Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote
|
I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already existing pdf files as input.
ocrmypdf.readthedocs.io/en/latest/index.html EDIT: I wrote about it before here: mobileread.com/forums/showthread.php?t=294101 Last edited by orebmur; 07-05-2021 at 05:36 PM. |
07-06-2021, 03:53 AM | #7 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set. It does not do italics (it did, and maybe will do it again) It does not strip headers and footers . My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page. Tesseract is not perfect. It is perfectly usable now for most fiction books. Last edited by roger64; 07-06-2021 at 04:03 AM. Reason: set |
|
07-06-2021, 05:36 AM | #8 |
Evangelist
Posts: 485
Karma: 2267928
Join Date: Nov 2015
Device: none
|
An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces... |
07-06-2021, 06:09 AM | #9 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
|
07-06-2021, 07:04 AM | #10 |
eReader Wrangler
Posts: 7,614
Karma: 48453107
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Tesseract with gImageReader is a really good combination. I didn't know about gImageReader until I read this thread, I've been using YGAF. Huge difference. Not quite sure what you mean when you say Tesseract won't work with PDFs though? Do you mean the whole PDF document at one shot?
Last edited by rcentros; 07-06-2021 at 07:12 AM. |
07-06-2021, 07:11 AM | #11 | |
eReader Wrangler
Posts: 7,614
Karma: 48453107
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
As for headers and footers, just exclude them when you choose your block of text. I'm guessing it's not as sophisticated as FineReader (which I've never seen) but it's still pretty good. |
|
07-06-2021, 09:45 AM | #12 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.
Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results. If you output to text, you can quickly process a full book. The HOCR format is heavier to handle. I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books... |
07-06-2021, 11:56 AM | #13 |
Wizard
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
|
This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.
|
07-06-2021, 03:15 PM | #14 | |||
eReader Wrangler
Posts: 7,614
Karma: 48453107
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
Quote:
Quote:
Looking forward to seeing some of the OCR results. Thanks for all the information. |
|||
07-06-2021, 03:22 PM | #15 | |
eReader Wrangler
Posts: 7,614
Karma: 48453107
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Report on Abbyy FineReader OCR Software w/ Canon Lide 60 | 1611mac | Workshop | 6 | 01-27-2012 06:05 PM |
Accessories Hand-held Scanner with OCR Software | Hopi | enTourage Archive | 7 | 01-26-2011 06:40 PM |
OCR Software Help | kpfeifle | Workshop | 5 | 03-01-2010 02:27 PM |
Recommendation for basic scanning software (non OCR) | yunti | Workshop | 1 | 11-27-2009 07:08 AM |
OCR-Software für altdeutsche Schrift | mtravellerh | Software | 9 | 02-19-2009 02:29 PM |