|  07-04-2021, 04:10 PM | #1 | 
| Member            Posts: 20 Karma: 1000010 Join Date: Jul 2015 Device: Kindle Paperwhite v1 |  Any recommended OCR software for Linux? 
			
			Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only.   No worries on the "back-end" side, since I can still use Calibre or Sigil, which support Linux out of the box... I heard about Tesseract but would like to hear from veteran book developers if they really recommend it, or is it about a typical 'beggars can't be choosers' sort of thing? So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task? Thank you! Last edited by simurq; 07-04-2021 at 04:20 PM. | 
|   |   | 
|  07-04-2021, 04:21 PM | #2 | 
| Fanatic            Posts: 531 Karma: 2268308 Join Date: Nov 2015 Device: none | 
			
			Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.
		 | 
|   |   | 
| Advert | |
|  | 
|  07-04-2021, 06:08 PM | #3 | 
| cosiñeiro            Posts: 1,406 Karma: 2451781 Join Date: Apr 2014 Device: BQ Cervantes 4 | 
			
			tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader. Since you're in Ubuntu I think you can install gImageReader from a ppa. FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying. | 
|   |   | 
|  07-04-2021, 07:16 PM | #4 | |
| Wizard            Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi) | Quote: 
 As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader. Last edited by roger64; 07-04-2021 at 07:19 PM. | |
|   |   | 
|  07-05-2021, 03:18 PM | #5 | 
| Fanatic            Posts: 531 Karma: 2268308 Join Date: Nov 2015 Device: none | |
|   |   | 
| Advert | |
|  | 
|  07-05-2021, 05:31 PM | #6 | 
| Veteran Linux user            Posts: 150 Karma: 1000000 Join Date: Mar 2017 Location: Barcelona/Spain Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote | 
			
			I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already  existing pdf files as input. ocrmypdf.readthedocs.io/en/latest/index.html EDIT: I wrote about it before here: mobileread.com/forums/showthread.php?t=294101 Last edited by orebmur; 07-05-2021 at 05:36 PM. | 
|   |   | 
|  07-06-2021, 03:53 AM | #7 | |
| Wizard            Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi) | Quote: 
 In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set. It does not do italics (it did, and maybe will do it again) It does not strip headers and footers . My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page. Tesseract is not perfect. It is perfectly usable now for most fiction books. Last edited by roger64; 07-06-2021 at 04:03 AM. Reason: set | |
|   |   | 
|  07-06-2021, 05:36 AM | #8 | 
| Fanatic            Posts: 531 Karma: 2268308 Join Date: Nov 2015 Device: none | 
			
			An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible. I can't imaging what quality Tesseract produces... | 
|   |   | 
|  07-06-2021, 06:09 AM | #9 | 
| Wizard            Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi) | |
|   |   | 
|  07-06-2021, 07:04 AM | #10 | 
| eReader Wrangler            Posts: 7,949 Karma: 53216495 Join Date: Mar 2013 Location: Boise, ID Device: PB HD3, GL3, Voyage | 
			
			Tesseract with gImageReader is a really good combination. I didn't know about gImageReader until I read this thread, I've been using YGAF. Huge difference. Not quite sure what you mean when you say Tesseract won't work with PDFs though? Do you mean the whole PDF document at one shot?
		 Last edited by rcentros; 07-06-2021 at 07:12 AM. | 
|   |   | 
|  07-06-2021, 07:11 AM | #11 | |
| eReader Wrangler            Posts: 7,949 Karma: 53216495 Join Date: Mar 2013 Location: Boise, ID Device: PB HD3, GL3, Voyage | Quote: 
 As for headers and footers, just exclude them when you choose your block of text. I'm guessing it's not as sophisticated as FineReader (which I've never seen) but it's still pretty good. | |
|   |   | 
|  07-06-2021, 09:45 AM | #12 | 
| Wizard            Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi) | 
			
			PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal. Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results. If you output to text, you can quickly process a full book. The HOCR format is heavier to handle. I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books... | 
|   |   | 
|  07-06-2021, 11:56 AM | #13 | 
| Wizard            Posts: 2,874 Karma: 10700629 Join Date: May 2016 Location: Canada Device: Onyx Nova | 
			
			This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.
		 | 
|   |   | 
|  07-06-2021, 03:15 PM | #14 | |||
| eReader Wrangler            Posts: 7,949 Karma: 53216495 Join Date: Mar 2013 Location: Boise, ID Device: PB HD3, GL3, Voyage | Quote: 
 Quote: 
 Quote: 
 Looking forward to seeing some of the OCR results. Thanks for all the information. | |||
|   |   | 
|  07-06-2021, 03:22 PM | #15 | |
| eReader Wrangler            Posts: 7,949 Karma: 53216495 Join Date: Mar 2013 Location: Boise, ID Device: PB HD3, GL3, Voyage | Quote: 
 | |
|   |   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Report on Abbyy FineReader OCR Software w/ Canon Lide 60 | 1611mac | Workshop | 6 | 01-27-2012 06:05 PM | 
| Accessories Hand-held Scanner with OCR Software | Hopi | enTourage Archive | 7 | 01-26-2011 06:40 PM | 
| OCR Software Help | kpfeifle | Workshop | 5 | 03-01-2010 02:27 PM | 
| Recommendation for basic scanning software (non OCR) | yunti | Workshop | 1 | 11-27-2009 07:08 AM | 
| OCR-Software für altdeutsche Schrift | mtravellerh | Software | 9 | 02-19-2009 02:29 PM |