Quote:
Originally Posted by BetterRed
And that's why in the context of calibre I interpret the word 'search' as 'query' - because that's what it does, it interrogates a relational database using Structured Query Language.
|
FWIW: calibre searches neither use SQL nor depend on the db structure. Search expressions are "compiled" into an abstract syntax tree (AST). The tree form of the expression is evaluated on a book-by-book basis, filtering results using set arithmetic to avoid evaluating a sub-expression which has no chance it can match. It further uses set arithmetic to filter the results by restriction (virtual library etc).
Unanchored non-regexp text matches are determined using ICU (International Components for Unicode) equivalency rules. The ICU package "compiles" the query text into a canonical form that takes into consideration accented character equivalences in the locale being used, then scans the text being examined for that form. That is why searches for "solzen" will find "Solženicyn, Aleksandr", "stepa" will find "Petr Štepánek", and "strasse" will find "Straße", at least in the English locale. BTW: using this process explains why searching can fail quite badly on OS X in some locales. Apple ships an old, broken ICU package.
One optimization I have considered is to reorder the AST so that expressions that have a better chance of being restrictive are evaluated first. This optimization could improve the performance of expressions like "foo and title:bar" because the naked search term "foo" would be checked only if the title contains bar. Evaluated as written, all of the fields permitted to be checked for naked search terms would be checked before the title is checked. I haven't bothered yet because I don't have strong evidence that any improvement merits the work.
Attempting to do stemming and sound equivalency in a product that runs in 100's of languages is far beyond anything I would want to try to do. And given that historically I and Kovid are the only two people who have shown interest in working on calibre's search expression analyzer, that probably means it isn't going to happen any time soon.