View Single Post
Old 07-28-2022, 09:26 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,481
Karma: 28000000
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Remember that an index file is always gong to be much larger than the text file, because its not just text but contains information about which records contain every word, and at what offset and word count, this is what powers the NEAR operator. So essentially every word has some number of extra fields associated with it. And there will be page size overhead for efficient lookup. And of course the full text is also actually stored so that snippets can be shown.

And calibre actually indexes all text twice for the "find related words" functionality, which works by stemming all tokens.
kovidgoyal is offline   Reply With Quote