MobileRead Forums - View Single Post

kovidgoyal · 07-28-2022, 10:26 PM

Remember that an index file is always gong to be much larger than the text file, because its not just text but contains information about which records contain every word, and at what offset and word count, this is what powers the NEAR operator. So essentially every word has some number of extra fields associated with it. And there will be page size overhead for efficient lookup. And of course the full text is also actually stored so that snippets can be shown.

And calibre actually indexes all text twice for the "find related words" functionality, which works by stemming all tokens.

07-28-2022, 10:26 PM	#9
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Remember that an index file is always gong to be much larger than the text file, because its not just text but contains information about which records contain every word, and at what offset and word count, this is what powers the NEAR operator. So essentially every word has some number of extra fields associated with it. And there will be page size overhead for efficient lookup. And of course the full text is also actually stored so that snippets can be shown. And calibre actually indexes all text twice for the "find related words" functionality, which works by stemming all tokens.