Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 07-27-2020, 01:30 PM   #16
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
Posts: 1,182
Karma: 345192
Join Date: Oct 2018
Device: Kobo Aura HD
Since a good part of my library is PDFs, using pdftotext sped things up considerably.

I did notice everything else lagging when I used 12 processes. I switched it to six and the lag disappeared.
ownedbycats is online now   Reply With Quote
Old 07-27-2020, 01:35 PM   #17
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Quote:
Originally Posted by DNSB View Post
And my test is still running after 8 hours (~11,500 books).
What is the progress after 8 hours? I'm curious if it's comparable to my numbers.

And yes, did you setup pdftotext properly?
mapozyan is offline   Reply With Quote
Advert
Old 07-27-2020, 01:48 PM   #18
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 8,974
Karma: 42838165
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Forma, Clara HD, Nexus 7 HD, iPad Pro, Tolino epos
Quote:
Originally Posted by mapozyan View Post
What is the progress after 8 hours? I'm curious if it's comparable to my numbers.

And yes, did you setup pdftotext properly?
I did setup pdftotext properly. As I mentioned in the edit comment though perhaps I should have put it in the body of the message, the progress finished about 2 minutes after I wrote the message.

My main reason for checking this out was that I use ElasticSearch with Greylog which states that part of it's reason for existence is to work around the shortcomings of ElasticSearch.

Last edited by DNSB; 07-27-2020 at 01:53 PM.
DNSB is offline   Reply With Quote
Old 07-27-2020, 02:09 PM   #19
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Quote:
Originally Posted by DNSB View Post
My main reason for checking this out was that I use ElasticSearch with Greylog which states that part of it's reason for existence is to work around the shortcomings of ElasticSearch.
Elasticsearch is a super advanced search engine. I guess one will hardly notice any shortcomings until they run some web-scale projects on top of it.

That said, I hope it's more than enough for local library management.
mapozyan is offline   Reply With Quote
Old 07-27-2020, 04:06 PM   #20
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
Posts: 1,182
Karma: 345192
Join Date: Oct 2018
Device: Kobo Aura HD
My library is about 4000 books and 20 GB, though the bulk of that is image-heavy PDF files (lots of video game strategy guides).

Last edited by ownedbycats; 07-27-2020 at 04:09 PM.
ownedbycats is online now   Reply With Quote
Advert
Old 07-27-2020, 04:17 PM   #21
thiago.eec
Evangelist
thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.
 
Posts: 449
Karma: 155084
Join Date: Dec 2016
Location: Goiânia - Brazil
Device: iPad, Kindle Paperwhite
Thanks for the plugin! Awesome idea.

I did not installed pdftotext, but I only have 18 PDFs on my library.
I'm testing it on a library with 1130 books (many with multiple formats: EPUB, AZW3/KFX) and about 3GB.

Info about the initial indexing: It took only 16 minutes to go from 0 to 99%. But now it is stuck at 99% for about 3h45min. My system is an i7 7700HQ (16GB of RAM). My processor has 4 cores (8 threads). Plugin has chosen 8 max parallel process. Now, the strange part: according to Task Manager (Windows), my CPU is only using 20% of its total capacity.

While writing this post, it finished, after 4h05min. Now it searches instantly! Nice!


------ My first impressions and questions ------


1) Question: When you have multiple formats for one book, does it lookup all the formats or just one?

2) Question: On caps.json, it only shows EPUB, MOBI, PDF and TXTs files. According to this, and other tests I have done, it does not index AZW3/KFX files. Is this correct?

3) Question: How the index works for new additions? Are the new files automatically indexed when I run ElasticSearch?

4) Suggestion: It would be really important to have more options for search. Right now, it searches word by word. So, I can't look for phrases or compound words (Ex: coffee table. It will search for books with "coffee" OR "table"). Also, accented characters are distinguished from non-accented.

5) Info: According to ElasticSearch Reference, to have more options for search, you would need to change your query from "match" to "query_string". This would allow operators, wild cards and regular expressions. P.S.: "match" queries can use operators too, but you would have to code that.

6) Info: The ZIP file attached to first post has another ZIP inside (with the full plugin).
thiago.eec is offline   Reply With Quote
Old 07-27-2020, 05:26 PM   #22
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 8,974
Karma: 42838165
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Forma, Clara HD, Nexus 7 HD, iPad Pro, Tolino epos
Quote:
Originally Posted by thiago.eec View Post
------ My first impressions and questions ------


1) Question: When you have multiple formats for one book, does it lookup all the formats or just one?
On my setup, it seems to have searched both epub and pdf format files. I don't keep other formats in my library.

Quote:
Originally Posted by thiago.eec View Post
3) Question: How the index works for new additions? Are the new files automatically indexed when I run ElasticSearch?
When I added a few files, the search seems to have imported them and searched them.

Quote:
Originally Posted by thiago.eec View Post
4) Suggestion: It would be really important to have more options for search. Right now, it searches word by word. So, I can't look for phrases or compound words (Ex: coffee table. It will search for books with "coffee" OR "table"). Also, accented characters are distinguished from non-accented.

5) Info: According to ElasticSearch Reference, to have more options for search, you would need to change your query from "match" to "query_string". This would allow operators, wild cards and regular expressions. P.S.: "match" queries can use operators too, but you would have to code that.

6) Info: The ZIP file attached to first post has another ZIP inside (with the full plugin).
A wild card or regex search capability would be great!

I used the 3rd zip file from message #9 in this thread. Note that my setup is on Windows x64.

I also restarted the full search after deleting the old setup and moving my computer related ebooks out of my calibre library. I also realized that I had not pointed to pdftotext properly and corrected that. Was a heck of a lot faster with those mostly oversized pdf files removed. 2 hours down to 5 minutes.
DNSB is offline   Reply With Quote
Old 07-27-2020, 05:51 PM   #23
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Thanks thiago.eec for feedback!

Quote:
Originally Posted by thiago.eec View Post
But now it is stuck at 99% for about 3h45min.
You might probably have encountered one of "bad" PDF files which is taking ages to process without pdftotext.



Quote:
Originally Posted by thiago.eec View Post
1) Question: When you have multiple formats for one book, does it lookup all the formats or just one?
It will lookup all formats.

Quote:
Originally Posted by thiago.eec View Post
2) Question: On caps.json, it only shows EPUB, MOBI, PDF and TXTs files. According to this, and other tests I have done, it does not index AZW3/KFX files. Is this correct?
Correct. Now plugin supports following formats: CHM, CBZ, FB2, PDB, DJVU, EPUB, MOBI, DOCX, PDF, TXT, RTF. I will try to add AZW3/KFX support in next release.


Quote:
Originally Posted by thiago.eec View Post
3) Question: How the index works for new additions? Are the new files automatically indexed when I run ElasticSearch?
Yes, the plugin will detect all files that were added to the library since the last search and will add them to index.

Quote:
Originally Posted by thiago.eec View Post
4) Suggestion: It would be really important to have more options for search. Right now, it searches word by word. So, I can't look for phrases or compound words (Ex: coffee table. It will search for books with "coffee" OR "table"). Also, accented characters are distinguished from non-accented.
Thanks, I am thinking on it. Will try to address in a next release.

Quote:
Originally Posted by thiago.eec View Post
6) Info: The ZIP file attached to first post has another ZIP inside (with the full plugin).
Haha indeed! Thanks for noticing!
mapozyan is offline   Reply With Quote
Old 07-27-2020, 06:19 PM   #24
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Quote:
Originally Posted by mapozyan View Post
Yes, the plugin will detect all files that were added to the library since the last search and will add them to index.
Sorry, I was not specific enough.

The new files won't be automatically indexed instantly when you add new books. But they will be indexed once you run Power Search again.
mapozyan is offline   Reply With Quote
Old 07-27-2020, 07:20 PM   #25
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 8,974
Karma: 42838165
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Forma, Clara HD, Nexus 7 HD, iPad Pro, Tolino epos
Quote:
Originally Posted by mapozyan View Post
Sorry, I was not specific enough.

The new files won't be automatically indexed instantly when you add new books. But they will be indexed once you run Power Search again.
Matches with what I saw. The search with the 3 added books went 33%, 66%, done.
DNSB is offline   Reply With Quote
Old 07-27-2020, 08:44 PM   #26
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
Posts: 1,182
Karma: 345192
Join Date: Oct 2018
Device: Kobo Aura HD
Quote:
Originally Posted by mapozyan View Post
Sorry, I was not specific enough.

The new files won't be automatically indexed instantly when you add new books. But they will be indexed once you run Power Search again.
While testing, I also had FanFicFare update some of my fanfics and then did a search for random words that appeared only in the newest chapters. It did re-index the ePubs that had changed.
ownedbycats is online now   Reply With Quote
Old 07-28-2020, 12:37 PM   #27
thiago.eec
Evangelist
thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.thiago.eec can grok the meaning of the universe.
 
Posts: 449
Karma: 155084
Join Date: Dec 2016
Location: Goiânia - Brazil
Device: iPad, Kindle Paperwhite
Quote:
Originally Posted by mapozyan View Post
You might probably have encountered one of "bad" PDF files which is taking ages to process without pdftotext.
The time consuming files are image PDFs and Fixed Layout books. But now I installed pdftotext and it is way faster!

Quote:
Originally Posted by mapozyan View Post
Correct. Now plugin supports following formats: CHM, CBZ, FB2, PDB, DJVU, EPUB, MOBI, DOCX, PDF, TXT, RTF. I will try to add AZW3/KFX support in next release.
I took a deeper look at the plugin. All files are converted to TXT, then passed to ElasticSearch. So, the allowed formats can be any of those handled by calibre. I changed the supported list to include DOC, AZW3, AZW4 and KFX and they all got indexed.

Quote:
Originally Posted by mapozyan View Post
Thanks, I am thinking on it. Will try to address in a next release.
I did a simple test here, and changing query from "match" to "match_phrase" did the job allowing phrases and compound words. Using "query_string" isn't that easy, though.
thiago.eec is offline   Reply With Quote
Old 07-28-2020, 06:32 PM   #28
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Quote:
Originally Posted by thiago.eec View Post
I took a deeper look at the plugin. All files are converted to TXT, then passed to ElasticSearch. So, the allowed formats can be any of those handled by calibre. I changed the supported list to include DOC, AZW3, AZW4 and KFX and they all got indexed.
Good to know, thanks!

I just decided to be conservative in first releases and support only those book formats that I could test well enough. I will extend this list in next release once I make sure it works well.

Quote:
Originally Posted by thiago.eec View Post
I did a simple test here, and changing query from "match" to "match_phrase" did the job allowing phrases and compound words. Using "query_string" isn't that easy, though.
My final goal is to implement search synax and semantic similar to Google search. After all, when doing search in Google we aren't using regular expressions, are we?

This however means that I would need to find a way of sorting results according to relevance. Don't know how easy is it to do in Calibre, but generally it seems to me a right way to go.
mapozyan is offline   Reply With Quote
Old 07-31-2020, 01:05 PM   #29
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Version 1.2.0 released.

Contains following usability improvements:
  • Phrase search when using quotes (example query: "Mark Twain" poetry)
  • Details section showing book conversion status

Adds support for DOC, AZW3, KFX file formats.
Attached Thumbnails
Click image for larger version

Name:	caps.1.2.0.png
Views:	58
Size:	77.6 KB
ID:	181039  

Last edited by mapozyan; 08-08-2020 at 02:00 PM. Reason: Removed attached version 1.2.0
mapozyan is offline   Reply With Quote
Old 07-31-2020, 01:18 PM   #30
mapozyan
Enthusiast
mapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheesemapozyan can extract oil from cheese
 
Posts: 34
Karma: 1002
Join Date: Jul 2020
Device: android
Quote:
Originally Posted by ownedbycats View Post
Searching for a compound word that uses a hyphen (e.g. up-to-date) also returns results for the separate words.
I tried to search for "up-to-date" as a phrase. It still doesn't work correctly, so "up-to-date" will bring the same results as "up to date". Still, its much more useful than searching for individual words.
mapozyan is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Search the Internet kiwidude Plugins 333 09-05-2020 06:04 PM
[GUI Plugin] Clipboard Search kiwidude Plugins 17 03-29-2020 04:07 AM
[GUI Plugin] Recoll Full Text Search Satas Plugins 16 08-05-2016 04:54 AM
[GUI Plugin] Full Text Search (SOLR) peterpisljar Plugins 2 08-09-2015 09:16 AM
Make a simple Plugin for Full Text Search using Recoll Satas Development 9 07-20-2013 05:15 PM


All times are GMT -4. The time now is 12:56 AM.


MobileRead.com is a privately owned, operated and funded community.