![]() |
Indexing - please consider this
I have Calibre. It's a lovely thing. I have Dropout because Calibre doesn't search inside documents. I am NOT a programmer. I have been known to hire programmers (in fact I have two working for me now on other projects) and if I had the money I'd plunk it down instantement (as they say in Quebec) for someone to do the following:
Plug the Lucene search engine into Calibre as a core feature. Imagine the sheer power of that for every student and scholar on the freakin' planet. Not only would they be able to assemble books by metadata (in Calibre) but they would be able to search INSIDE the documents themselves, enabling a query based research practice that would have mind blowing implications for the humanities. :2thumbsup Seriously. Dropout is free (as in beer) and is based on the Apache Lucene Search engine. You can DL it here: https://dropout.codeplex.com/ What's super cool is not only does it index the contents of your books, it's TRANSPORTABLE, so you put all your books on a portable drive, index it and then carry it with you - you then have your own Personal Portable Research Library that you can use where ever you go. And given how title centric Calibre is, imagine if you call up a title and search inside that specific title itself and find what you need... Gates: 1. Lucene works on Linux and Windows. I don't know if it works on Mac, much less iOS or Android. (So it would have to be ported or run in a virtual machine - testing that would suck. A lot. yadda yadda) 2. It would require a significant rewrite of Calibre and Calibre's UI or some kind of a branch of Calibre. Branching software is a precarious journey, so it would be better if Calibre itself absorbed this functionality. 3. Portability - I don't know if Calibre is portable, even within a platform. (Yeah, I should know that, but Julian gave me another rum and Coke...) Benefits: THINK ABOUT IT for more than 5 seconds. *Mind. Blown.* There would be no reason to use any other ereading / indexing app ever, for anyone, anywhere. It would be a complete ereading solution. If I had the money, I'd pay someone tomorrow to build it. Seriously. Possible Plan B: 1. A Lucene plug in? That could be ugly as home made sin, but it *could* work. It would not be as elegant or as useful as being built into Calibre itself. Seriously folks : Imagine if Calibre indexed the contents of your books. If you don't care, I can assure you there are MILLIONS of students and scholars who would pee their pants (well, they would be "overjoyed") at such a thing. I look forward to this conversation. warm regards to you wonderful people! Stuart Studebaker |
Moderator Notice Please follow the 'before posting' directions in the sticky in development. Moved out to Calibre. |
Lucerne's inability to search the major ebook formats, and its dependence on the effectively Windows only .NET framework probably means there's very little chance that the authors of Calibre would provide any support for it in the core product. BTW Windows Search uses IFilters (that's why MS 'invented' them) - I have 11 installed. But sadly none for the popular ebook formats. If you know where one can get IFilters for EPUB. MOBI etc, then I too would really like to know where :) If Lucerne can produce a list of the files that meet the search criteria, then you could push that list into the Import List PI and get it to create a Reading List. And have a look at the Recoll Full Text Search Plugin - maybe you could use it as a model for developing your own plugin for Lucerne - Recoll is a Search Tool that runs on Linux and OS/X, it also indexes and searches EPUB files amongst the usual suspects. On what basis do you make the judgement that using a plugin would be as "ugly as home made sin" BR |
Calibre is pyqt isn't it? There are python ports of lucene, have been for years: http://lucene.apache.org/pylucene/ and Qt itself uses clucene internally (or used to, back when I last used it, which admittedly is 5-6 years and two major versions ago) so it was like a freebie anyway. There's no .net dependency specific to lucene itself, it was originally written in Java and it's been ported to practically everything (both in terms of operating system and programming language).
Ebook formats are also nothing special, most of them are a bundle of xml files wrapped in compression, and there's mountains of code already in calibre for taking apart, parsing and repackaging them, so that part isn't really a big deal I wouldn't think. There's a lot of things Calibre can do that MS doesn't :) Indexing pdf's is probably harder, but there's utilities for that too. This is built into some bibliographic software like Zotero, but most of them simply run a commandline pdftotext and index the results. They aren't fancy at all, but they do the job. |
That's not a python port, thats a python wrapper that uses JNI, which would need the jvm to be distributed with calibre, which is absolutely never going to happen. If you want to use lucene use clucene or lucy. However, I would suggest using Xapian instead.
And doing a full text search is on my TODO list. |
Ah, didn't realise it was just a wrapper, thanks for the correction.
Still, good to hear it's on the TODO list, however it's done. |
| All times are GMT -4. The time now is 07:15 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.