MobileRead Forums - View Single Post

sbenz · 11-10-2014, 11:08 AM

Quote:

Originally Posted by BetterRed

The calibre editor provides a facility to maintain and import user defined 'dictionaries', they are stored in %calibre_config_directory%\dictionaries\prefs.json .

I wonder if sbenz's Bad Words PI could leverage that feature. Assuming different Bad Word 'dictionaries' would be used for different physical libraries then only the dictionary name need be stored in the library database.

Another advantage of leveraging the existing user dictionary facility might be that it is language aware.

I was hoping this capability existed since I would need to share the Bad Word dictionary between a FileTypePlugin and a UserInterfacePlugin per Kovid's comments.

Using multiple dictionaries occurred to me also for other reasons. I will explore this more after it is running.

Quote:

Originally Posted by BetterRed

I doubt that storing the bad word list in the database would have any noticeable impact on performance. As for cluttering, one of my databases has 31 tables, 61 indices, 13 views and 46 triggers. I assume a bad_word_list would be a row in the preferences table and an update to the preferences autoindex, I can't see how that could be regarded as 'cluttering' the database.

One of my user dictionaries has ~6800 words, it weighs-in at 105MB, I'm pretty certain that SQLite can handle something of that size and much larger.

Apparently I have not been very clear regarding my data storage concerns - and they may not be a problem. The issue is not storing the BW dictionary, but rather the results for each book, BW_BookStats. Consider the following:

BW_Dict = list of bad words; stored only once in preferences
BW_BookStats = dict of (bw, count) pairs where count is number of occurrences of bw in the book.

BW_BookStats is per book. If BW_BookStats averages 10KB, a 10,000 book library now has 100MB of data. Since the user controls the size of the BW_Dict and the number of books, I don't know how large this might grow.
Reducing total BW_BookStats storage:
- Zero counts are not stored
- Only a few byte ActionStatus is stored for 'Clean' books and books marked 'DoNotScan'
BW_BookStats is likely used only once per book per library user to decide a course of action. Perhaps only a small bw summary should be provided and the BW_BookStats should just be rebuilt on demand. However, scanning can take a lot of time, especially on large books with a large bw list.

I really appreciate your input.
sb