Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > KOReader

Notices

Reply
 
Thread Tools Search this Thread
Old 12-29-2020, 12:13 PM   #1
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Problem with ocr

Hi,
I have installed ell.traineddata and grc.traineddata into koreader/data/tessdata, but KOReader doesn't recognize a scanned pdf I have in Ancient Greek, even I have switched on the "Forced OCR".

I would also like to ask why there are only two options for "Document Language", English and Chinese?

Thank you for your help!


P.S.: Let me wish you all a blessed new year. May the light of the newborn Christ illuminate your heart in a dark hopeless world! (sorry if it is not politically correct)
ichnilatis is offline   Reply With Quote
Old 12-29-2020, 01:06 PM   #2
Frenzie
Wizard
Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.
 
Posts: 1,167
Karma: 437844
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
I suspect it was written by a Chinese contributor many years ago. Ideally someone would polish it a bit by making the options depend on what's in that folder, but for the moment you can set the default in persistent.defaults.lua.

Incidentally, is there a document available on Archive.org or some such to test with?
Frenzie is offline   Reply With Quote
Advert
Old 12-29-2020, 01:26 PM   #3
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Quote:
Originally Posted by Frenzie View Post
I suspect it was written by a Chinese contributor many years ago. Ideally someone would polish it a bit by making the options depend on what's in that folder, but for the moment you can set the default in persistent.defaults.lua.

Incidentally, is there a document available on Archive.org or some such to test with?
What word should be instead of "Chinese"? I mean what change I should make in persistent.defaults.lua?

I upload a page of a scanned book. I noticed that the book I was reading was in djvu format. I converted the page into pdf for you. I believe that the problem exist both for pdf and djvu.
Attached Files
File Type: pdf p0242.pdf (600.6 KB, 31 views)
ichnilatis is offline   Reply With Quote
Old 12-29-2020, 04:35 PM   #4
Frenzie
Wizard
Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.
 
Posts: 1,167
Karma: 437844
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
The text is meaningless really, it's the three letters hidden behind it that count. In your case grc and ell.
https://github.com/koreader/koreader....lua#L115-L118
Frenzie is offline   Reply With Quote
Old 12-29-2020, 04:43 PM   #5
Frenzie
Wizard
Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.
 
Posts: 1,167
Karma: 437844
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
It works for me — more or less. The OCR isn't great at spaces in italic.
Attached Thumbnails
Click image for larger version

Name:	Screenshot_2020-12-29_21-42-26.png
Views:	59
Size:	616.0 KB
ID:	184359  

Last edited by Frenzie; 12-29-2020 at 05:31 PM. Reason: typo
Frenzie is offline   Reply With Quote
Advert
Old 12-30-2020, 04:55 AM   #6
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
So, do I have to make this correction?

-- document languages for OCR
DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Ancient Greek"}
DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc"} -- language code, make sure you have corresponding training data
DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- that have filenames starting with the language codes

From the screenshot you sent I conclude that the breathings (᾿ ῾), the circumflex (῀) and the grave accent (`) are not recognized... and some letters

Can this problem be solved?
ichnilatis is offline   Reply With Quote
Old 12-30-2020, 07:02 AM   #7
Frenzie
Wizard
Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.
 
Posts: 1,167
Karma: 437844
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Quote:
Originally Posted by ichnilatis View Post
So, do I have to make this correction?

-- document languages for OCR
DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Ancient Greek"}
DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc"} -- language code, make sure you have corresponding training data
DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- that have filenames starting with the language codes
Something like that, yes. If you want to keep it, make sure to put it in persistent.defaults.lua.

Quote:
From the screenshot you sent I conclude that the breathings (᾿ ῾), the circumflex (῀) and the grave accent (`) are not recognized... and some letters

Can this problem be solved?
It's probably much less of a problem in non-italic text, but unless you have a slightly higher DPI original document not really. A newer version of Tesseract might also do slightly better.
Frenzie is offline   Reply With Quote
Old 12-30-2020, 07:17 AM   #8
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Quote:
Originally Posted by Frenzie View Post
Something like that, yes. If you want to keep it, make sure to put it in persistent.defaults.lua.
Not just in defaults.lua? Where can I find persistent.defaults.lua?

Quote:
Originally Posted by Frenzie View Post
It's probably much less of a problem in non-italic text, but unless you have a slightly higher DPI original document not really. A newer version of Tesseract might also do slightly better.
I use the Version 3.04, as it is recommended here. Can I use a newer version of Tesseract?

Thanks for your replies!
ichnilatis is offline   Reply With Quote
Old 12-30-2020, 11:29 AM   #9
Frenzie
Wizard
Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.Frenzie ought to be getting tired of karma fortunes by now.
 
Posts: 1,167
Karma: 437844
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
Quote:
Originally Posted by ichnilatis View Post
Not just in defaults.lua? Where can I find persistent.defaults.lua?
It's a file you have to create yourself. defaults.lua will be overwritten by updates.



Quote:
Can I use a newer version of Tesseract?
Not in KOReader, but an update to Tesseract 4 is coming. I wouldn't count on any noticeable improvements except in some edge cases, but at the same time it's probably not getting any worse either.
Frenzie is offline   Reply With Quote
Old 12-30-2020, 11:44 AM   #10
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Frenzie, I have made the correction in defaults.lua and individual words are recognized correctly. (I try to take a screenshot to show you, but I can't. I've just make a thread with this question...) But, when I choose more than one words and then I choose dictionary at the popup menu, nothing happens.

Also, I notice that when I highlight one or more words, the text isn't shown in the bookmark, as usually, but only the page and the time.

Quote:
Originally Posted by Frenzie View Post
It's a file you have to create yourself. defaults.lua will be overwritten by updates.
You mean I can make the file with Notepad with just the above mentioned text for ocr?

One more question: Why there are only two options for the text language? What should be the second option instead of "Chinese"? Each user has to make the change manually in the defaults.lua?

Thanks again!
ichnilatis is offline   Reply With Quote
Old 12-30-2020, 12:08 PM   #11
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Quote:
Originally Posted by Frenzie View Post
It works for me — more or less. The OCR isn't great at spaces in italic.
Actually the problem isn't with the italics but with Greek (even Modern Greek). When I highlight a text with upright letters, both Greek and English, in the popup dictionary window there are not spaces among the Greek words, but only among the English ones.

It's a pity I can't take a screenshot of this...
ichnilatis is offline   Reply With Quote
Old 12-30-2020, 01:14 PM   #12
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 2,785
Karma: 11885355
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
If you're doing this on your Pocketbook device, you _can_ take a screenshot. You can set some button (e.g. Power Double Press) to capture a screenshot by configuring it in Settings>Personalize>Key Mapping. The screenshots end up as bitmap images in the /screens folder, so you can get them via USB from there.
rkomar is offline   Reply With Quote
Old 12-30-2020, 01:24 PM   #13
pazos
cosiñeiro
pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.pazos ought to be getting tired of karma fortunes by now.
 
Posts: 790
Karma: 1357911
Join Date: Apr 2014
Device: BQ Cervantes 4
Quote:
Originally Posted by rkomar View Post
If you're doing this on your Pocketbook device, you _can_ take a screenshot. You can set some button (e.g. Power Double Press) to capture a screenshot by configuring it in Settings>Personalize>Key Mapping. The screenshots end up as bitmap images in the /screens folder, so you can get them via USB from there.
IIRC the native screenshot tool won't work on KOReader. It is intended for apps built against the PBSDK.
pazos is offline   Reply With Quote
Old 12-30-2020, 02:18 PM   #14
ichnilatis
Connoisseur
ichnilatis began at the beginning.
 
Posts: 51
Karma: 10
Join Date: Jul 2020
Location: Greece
Device: Pocketbook Touch Lux 5
Quote:
Originally Posted by rkomar View Post
If you're doing this on your Pocketbook device, you _can_ take a screenshot. You can set some button (e.g. Power Double Press) to capture a screenshot by configuring it in Settings>Personalize>Key Mapping. The screenshots end up as bitmap images in the /screens folder, so you can get them via USB from there.
Unfortunately the images taken this way from KOReader are blank.

However, thank you for your help.
ichnilatis is offline   Reply With Quote
Old 12-30-2020, 03:53 PM   #15
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 2,785
Karma: 11885355
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
Ah, I assumed that the system was taking it from the framebuffer, since it works with the home screen. Sorry for the misdirection.
rkomar is offline   Reply With Quote
Reply

Tags
ocr

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR problem in a PDF flie mzel KOReader 16 04-21-2020 02:09 PM
pages in OCR cloclo36 Assistance 0 06-03-2019 01:19 PM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 06:58 AM
Do I have to OCR? Ceryta Workshop 7 05-07-2011 12:03 PM
OCR to use pepak Workshop 17 05-26-2008 06:30 PM


All times are GMT -4. The time now is 02:01 PM.


MobileRead.com is a privately owned, operated and funded community.