Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 07-04-2021, 04:10 PM   #1
simurq
Member
simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.simurq ought to be getting tired of karma fortunes by now.
 
Posts: 20
Karma: 1000010
Join Date: Jul 2015
Device: Kindle Paperwhite v1
Question Any recommended OCR software for Linux?

Some time ago I've completely moved to Linux (Ubuntu-based Pop_OS!). Now I have more than 500 scanned pages/images of a book which I'm going to convert to an epub. Unfortunately, ABBYY ceased any support for Linux, with Fine Reader being Windows compatible only. No worries on the "back-end" side, since I can still use Calibre or Sigil, which support Linux out of the box... I heard about Tesseract but would like to hear from veteran book developers if they really recommend it, or is it about a typical 'beggars can't be choosers' sort of thing?

So, any tips, advice on alternative Linux-compatible OCR software that you would recommend to accomplish the task?

Thank you!

Last edited by simurq; 07-04-2021 at 04:20 PM.
simurq is offline   Reply With Quote
Old 07-04-2021, 04:21 PM   #2
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.
Sarmat89 is offline   Reply With Quote
Advert
Old 07-04-2021, 06:08 PM   #3
pazos
cosiñeiro
pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.
 
Posts: 1,271
Karma: 2200049
Join Date: Apr 2014
Device: BQ Cervantes 4
tesseract is the only real alternative that runs natively on linux. Try it for yourself and see if it works for you. It is a command line utility but there're some GUI frontends, like GimageReader.

Since you're in Ubuntu I think you can install gImageReader from a ppa.

FineReader might be from better to waaaaay better depending on the source document and language used. But if you're using english or other wide used language based on the latin alphabet you won't loose anything for trying.
pazos is offline   Reply With Quote
Old 07-04-2021, 07:16 PM   #4
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Sarmat89 View Post
Try installing FineReader in a virtual machine or with Wine. The native solutions are useless for more than a couple of pages.
As Tesseract does not work for PDF, it may be easier to use FineReader for this format because this supplementary conversion may somewhat degrade the output. If you wish to convert PDF to image format, you can use imagemagick. There are also some online converters.

As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader.

Last edited by roger64; 07-04-2021 at 07:19 PM.
roger64 is offline   Reply With Quote
Old 07-05-2021, 03:18 PM   #5
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
Quote:
Originally Posted by roger64 View Post
Tesseract is working well, now.
It does diacritics?
It does italics?
It strips headers/footers?
It recognizes custom words?
Sarmat89 is offline   Reply With Quote
Advert
Old 07-05-2021, 05:31 PM   #6
orebmur
Veteran Linux user
orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.orebmur ought to be getting tired of karma fortunes by now.
 
Posts: 144
Karma: 678910
Join Date: Mar 2017
Location: Barcelona/Spain
Device: Boyue Likebook Note & Mimas, Hisense A5, hopefully soon a PineNote
I'd suggest to install ocrmypdf in Ubuntu, Debian, etc. I happily use it since years for my scanned books OCR needs and can only recommend it. It relies on tesseract as the OCR backend and produces excellent PDF documents from either scanned images or already existing pdf files as input.

ocrmypdf.readthedocs.io/en/latest/index.html

EDIT:
I wrote about it before here: mobileread.com/forums/showthread.php?t=294101

Last edited by orebmur; 07-05-2021 at 05:36 PM.
orebmur is offline   Reply With Quote
Old 07-06-2021, 03:53 AM   #7
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Sarmat89 View Post
It does diacritics?
It does italics?
It strips headers/footers?
It recognizes custom words?
I have been using Tesseract to OCR about a hundred books and still do. OCR quality (with a quality scan) is at such a good level that most of the pages have zero mistake.

In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set.

It does not do italics (it did, and maybe will do it again)
It does not strip headers and footers .

My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page.

Tesseract is not perfect. It is perfectly usable now for most fiction books.

Last edited by roger64; 07-06-2021 at 04:03 AM. Reason: set
roger64 is offline   Reply With Quote
Old 07-06-2021, 05:36 AM   #8
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces...
Sarmat89 is offline   Reply With Quote
Old 07-06-2021, 06:09 AM   #9
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Sarmat89 View Post
I can't imaging what quality Tesseract produces...
Stop imagining, just use it. It's good.
roger64 is offline   Reply With Quote
Old 07-06-2021, 07:04 AM   #10
rcentros
eReader Wrangler
rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.
 
rcentros's Avatar
 
Posts: 7,443
Karma: 48453105
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
Quote:
Originally Posted by roger64 View Post
As far as image format is concerned, if I had not been successfully using Tesseract for 18 months for full fiction scans, I could believe you. Tesseract is working well, now. Using Archlinux with Tesseract 4.1.1. for French or English language with gimageReader.
Tesseract with gImageReader is a really good combination. I didn't know about gImageReader until I read this thread, I've been using YGAF. Huge difference. Not quite sure what you mean when you say Tesseract won't work with PDFs though? Do you mean the whole PDF document at one shot?

Last edited by rcentros; 07-06-2021 at 07:12 AM.
rcentros is offline   Reply With Quote
Old 07-06-2021, 07:11 AM   #11
rcentros
eReader Wrangler
rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.
 
rcentros's Avatar
 
Posts: 7,443
Karma: 48453105
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
Quote:
Originally Posted by Sarmat89 View Post
An average book contains about 1500 italics fragments; adding them manually will take days. Also, without uncertain characters, orthography checking and interactive control there is no quality recognition possible.
I can't imaging what quality Tesseract produces...
Tesseract works well, especially when using it with gImageReader. As for italics, bold, etc., just mark up your text as you go. Then when you move your text into your word processor, search for the codes and make your changes. If you do it at a chapter a time it's not that big of a burden. Especially with novels, where's there's hardly any italics or bold fonts anyhow.

As for headers and footers, just exclude them when you choose your block of text. I'm guessing it's not as sophisticated as FineReader (which I've never seen) but it's still pretty good.
rcentros is offline   Reply With Quote
Old 07-06-2021, 09:45 AM   #12
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.

Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.

If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...
roger64 is offline   Reply With Quote
Old 07-06-2021, 11:56 AM   #13
Pajamaman
Wizard
Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.
 
Pajamaman's Avatar
 
Posts: 2,827
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.
Pajamaman is offline   Reply With Quote
Old 07-06-2021, 03:15 PM   #14
rcentros
eReader Wrangler
rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.
 
rcentros's Avatar
 
Posts: 7,443
Karma: 48453105
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
Quote:
Originally Posted by roger64 View Post
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.
I guess the book I experimented with must have been "hybrid." Now I understand what you're saying.

Quote:
Originally Posted by roger64 View Post
Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.
That's good to know. I think it works pretty well now — so looking forward to the improvements.

Quote:
Originally Posted by roger64 View Post
If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...
I haven't really tried the HOCR feature yet. The text feature's only drawback is (as one poster mentioned) that it doesn't retain bold and italic. But making the text "flowable" is easy in Jstar. I was able to convert a 7 page Foreward from an older book in about ten minutes (including adding the italics). So whole (200 page, or so) book would probably take a few hours. That's with clean text.

Looking forward to seeing some of the OCR results. Thanks for all the information.
rcentros is offline   Reply With Quote
Old 07-06-2021, 03:22 PM   #15
rcentros
eReader Wrangler
rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.rcentros ought to be getting tired of karma fortunes by now.
 
rcentros's Avatar
 
Posts: 7,443
Karma: 48453105
Join Date: Mar 2013
Location: Boise, ID
Device: PB HD3, GL3, Tolino Vision 4, Voyage, Clara HD
Quote:
Originally Posted by Pajamaman View Post
This sums up with I haven't moved to Linux yet. Lack of software choice. Not a criticism, just an observation. Last time I tried Ubuntu I was pleasantly surprised, except for some weird decisions, like no blank image-free desktop background. Weird decision. Open-source folks seem like Communist committees, which are a good idea in practice, but which come up with crazy decisions.
I don't like Ubuntu's desktop either. (I wasn't a fan of Unity and I'm not a fan of the Gnome 3s GUI. I could probably get used to it, but I don't want to.) Linux Mint (while based on Ubuntu) uses a "traditional" Windows-like desktop (and this is consistent with all three "flavors," Cinnamon, Mate and Xfce). I don't know what you mean by "no blank image-free desktop background" in Ubuntu, but in Linux Mint you can use whatever image (including a blank image) you want.
rcentros is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Report on Abbyy FineReader OCR Software w/ Canon Lide 60 1611mac Workshop 6 01-27-2012 06:05 PM
Accessories Hand-held Scanner with OCR Software Hopi enTourage Archive 7 01-26-2011 06:40 PM
OCR Software Help kpfeifle Workshop 5 03-01-2010 02:27 PM
Recommendation for basic scanning software (non OCR) yunti Workshop 1 11-27-2009 07:08 AM
OCR-Software für altdeutsche Schrift mtravellerh Software 9 02-19-2009 02:29 PM


All times are GMT -4. The time now is 08:12 AM.


MobileRead.com is a privately owned, operated and funded community.