Any recommended OCR software for Linux?

simurq · 07-04-2021, 04:10 PM

Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only.

No worries on the "back-end" side, since I can still use Calibre or Sigil, which support Linux out of the box... I heard about Tesseract but would like to hear from veteran book developers if they really recommend it, or is it about a typical 'beggars can't be choosers' sort of thing?

So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task?

Thank you!

Sarmat89 · 07-04-2021, 04:21 PM

Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.

pazos · 07-04-2021, 06:08 PM

tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader.

Since you're in Ubuntu I think you can install gImageReader from a ppa.

FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying.

roger64 · 07-04-2021, 07:16 PM

Quote:

Originally Posted by Sarmat89

Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.

As Tesseract does not work for PDF, it may be easier to use FineReader for this format because this supplementary conversion may somewhat degrade the output. If you wish to convert PDF to image format, you can use imagemagick. There are also some online converters.

As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader.

Sarmat89 · 07-05-2021, 03:18 PM

Quote:

Originally Posted by roger64

Tesseract is working well, now.

It does diacritics?
It does italics?
It strips headers/footers?
It recognizes custom words?

orebmur · 07-05-2021, 05:31 PM

I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already existing pdf files as input.

ocrmypdf.readthedocs.io/en/latest/index.html

EDIT:
I wrote about it before here: mobileread.com/forums/showthread.php?t=294101

roger64 · 07-06-2021, 03:53 AM

Quote:

Originally Posted by Sarmat89

It does diacritics?
It does italics?
It strips headers/footers?
It recognizes custom words?

I have been using Tesseract to OCR about a hundred books and still do. OCR quality (with a quality scan) is at such a good level that most of the pages have zero mistake.

In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set.

It does not do italics (it did, and maybe will do it again)
It does not strip headers and footers .

My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page.

Tesseract is not perfect. It is perfectly usable now for most fiction books.

Sarmat89 · 07-06-2021, 05:36 AM

An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces...

roger64 · 07-06-2021, 06:09 AM

Quote:

Originally Posted by Sarmat89

I can't imaging what quality Tesseract produces...

Stop imagining, just use it. It's good.

rcentros · 07-06-2021, 07:04 AM

Quote:

Originally Posted by roger64

As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader.

Tesseract with gImageReader is a really good combination. I didn't know about gImageReader until I read this thread, I've been using YGAF. Huge difference. Not quite sure what you mean when you say Tesseract won't work with PDFs though? Do you mean the whole PDF document at one shot?

rcentros · 07-06-2021, 07:11 AM

Quote:

Originally Posted by Sarmat89

An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces...

Tesseract works well, especially when using it with gImageReader. As for italics, bold, etc., just mark up your text as you go. Then when you move your text into your word processor, search for the codes and make your changes. If you do it at a chapter a time it's not that big of a burden. Especially with novels, where's there's hardly any italics or bold fonts anyhow.

As for headers and footers, just exclude them when you choose your block of text. I'm guessing it's not as sophisticated as FineReader (which I've never seen) but it's still pretty good.

roger64 · 07-06-2021, 09:45 AM

PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.

Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.

If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...

Pajamaman · 07-06-2021, 11:56 AM

This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.

rcentros · 07-06-2021, 03:15 PM

Quote:

Originally Posted by roger64

PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.

I guess the book I experimented with must have been "hybrid." Now I understand what you're saying.

Quote:

Originally Posted by roger64

Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.

That's good to know. I think it works pretty well now — so looking forward to the improvements.

Quote:

Originally Posted by roger64

If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...

I haven't really tried the HOCR feature yet. The text feature's only drawback is (as one poster mentioned) that it doesn't retain bold and italic. But making the text "flowable" is easy in Jstar. I was able to convert a 7 page Foreward from an older book in about ten minutes (including adding the italics). So whole (200 page, or so) book would probably take a few hours. That's with clean text.

Looking forward to seeing some of the OCR results. Thanks for all the information.

rcentros · 07-06-2021, 03:22 PM

Quote:

Originally Posted by Pajamaman

This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.

I don't like Ubuntu's desktop either. (I wasn't a fan of Unity and I'm not a fan of the Gnome 3s GUI. I could probably get used to it, but I don't want to.) Linux Mint (while based on Ubuntu) uses a "traditional" Windows-like desktop (and this is consistent with all three "flavors," Cinnamon, Mate and Xfce). I don't know what you mean by "no blank image-free desktop background" in Ubuntu, but in Linux Mint you can use whatever image (including a blank image) you want.

07-04-2021, 04:10 PM	#1
simurq Member Posts: 20 Karma: 1000010 Join Date: Jul 2015 Device: Kindle Paperwhite v1	Any recommended OCR software for Linux? Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only. No worries on the "back-end" side, since I can still use Calibre or Sigil, which support Linux out of the box... I heard about Tesseract but would like to hear from veteran book developers if they really recommend it, or is it about a typical 'beggars can't be choosers' sort of thing? So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task? Thank you! Last edited by simurq; 07-04-2021 at 04:20 PM.

07-05-2021, 05:31 PM	#6
orebmur Veteran Linux user Posts: 144 Karma: 678910 Join Date: Mar 2017 Location: Barcelona/Spain Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote	I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already existing pdf files as input. ocrmypdf.readthedocs.io/en/latest/index.html EDIT: I wrote about it before here: mobileread.com/forums/showthread.php?t=294101 Last edited by orebmur; 07-05-2021 at 05:36 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Report on Abbyy FineReader OCR Software w/ Canon Lide 60	1611mac	Workshop	6	01-27-2012 06:05 PM
Accessories Hand-held Scanner with OCR Software	Hopi	enTourage Archive	7	01-26-2011 06:40 PM
OCR Software Help	kpfeifle	Workshop	5	03-01-2010 02:27 PM
Recommendation for basic scanning software (non OCR)	yunti	Workshop	1	11-27-2009 07:08 AM
OCR-Software für altdeutsche Schrift	mtravellerh	Software	9	02-19-2009 02:29 PM

07-04-2021, 04:21 PM	#2
Sarmat89 Evangelist Posts: 485 Karma: 2267928 Join Date: Nov 2015 Device: none	Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.

07-04-2021, 06:08 PM	#3
pazos cosiñeiro Posts: 1,294 Karma: 2200073 Join Date: Apr 2014 Device: BQ Cervantes 4	tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader. Since you're in Ubuntu I think you can install gImageReader from a ppa. FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying.

07-06-2021, 05:36 AM	#8
Sarmat89 Evangelist Posts: 485 Karma: 2267928 Join Date: Nov 2015 Device: none	An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible. I can't imaging what quality Tesseract produces...

07-06-2021, 09:45 AM	#12
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal. Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results. If you output to text, you can quickly process a full book. The HOCR format is heavier to handle. I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...

07-06-2021, 11:56 AM	#13
Pajamaman Wizard Posts: 2,827 Karma: 10700629 Join Date: May 2016 Location: Canada Device: Onyx Nova	This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.

Advert

Advert