04-29-2011, 12:36 PM | #211 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
Even good designed db's can become slow(er) but then they are TB's of size. So you're right that good designed db's are not slow. But there is no 'start-test' for plugin builders to proof they can build good db's. Your db may be good, but if a plugin builder adds own info to the db, it can slow down the complete process. If it is a plugin that runs all ways, calibre is the one to be blamed (slow program). This combination is why it would be nice to have a second db that can be used by plugin-builders and can be queried by the calibre - API. |
|
04-29-2011, 01:18 PM | #212 |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
@kiwidude: couldn't get to this until now.
The problem is as Kovid alluded -- SQLite sucks at database writes, at least on windows. For each transaction, it creates a new journal file, then deletes it. This process is *slow*. I built a version of the code you gave me that reads all the data at startup then writes it all back, using 2 new API methods. Running it on my production DB, I get the following times. Note that I set the size to 1 to force a collision group containing all formats. Code:
DO THE HASHING Pass 1: 2100 formats created 1 size collisions Analysed 50 after 0.25 Analysed 100 after 0.438000202179 Analysed 150 after 0.648000001907 ... Analysed 2000 after 9.30000019073 Analysed 2050 after 9.51600003242 Analysed 2100 after 9.96700000763 Completed duplicate analysis in: 10.7790000439 Found 0 duplicate groups covering 0 books RUN AGAIN, USING MAP Pass 1: 2100 formats created 1 size collisions Analysed 50 after 0.0350000858307 Analysed 100 after 0.0360000133514 Analysed 150 after 0.0380001068115 ... Analysed 2000 after 0.0520000457764 Analysed 2050 after 0.0520000457764 Analysed 2100 after 0.0529999732971 Completed duplicate analysis in: 1.18799996376 QUIT CALIBRE AND RUN AGAIN WITH MAP TO REDUCE CACHE EFFECTS Pass 1: 2100 formats created 1 size collisions Analysed 50 after 0.0409998893738 Analysed 100 after 0.0429999828339 Analysed 150 after 0.0439999103546 ... Analysed 2000 after 0.0569999217987 Analysed 2050 after 0.0569999217987 Analysed 2100 after 0.0579998493195 Completed duplicate analysis in: 1.10799980164 Found 0 duplicate groups covering 0 books Spoiler:
I have submitted the calibre API changes.
|
Advert | |
|
04-29-2011, 01:39 PM | #213 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thx very much for the API changes Charles, I'll give it another attempt when Kovid merges them - looks like you missed the 0.7.58 cutoff. So depending on how long it takes me to sort out all the exemption stuff I may save it for another release in a weeks time.
Obviously I want to give it a thrash on the large library to see the impact. As per my last post on this the biggest hit from my results is by far the os.stat pass which this change can't help with. Out of curiosity - did you try restarting your PC and running the check again - do the numbers come back a lot higher than the 1.1 seconds (i.e. removing the os.stat cache effect)? |
04-29-2011, 01:56 PM | #214 | ||
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
I will rerun everything with booting before each run. |
||
04-29-2011, 02:27 PM | #215 |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Results of test including booting.
Test sequence: empty the plugin data table. Do twice: boot & wait for quiescence. Start calibre, push dup check button. Stop calibre. The test is three times faster with the cache than without it. Code:
First run. Requires hashing all books. Starting up... Started up in 2.7619998455 Pass 1: 2100 formats created 1 size collisions Analysed 50 after 1.2009999752 Analysed 100 after 2.23099994659 Analysed 150 after 3.25999999046 ... Analysed 2000 after 44.8190000057 Analysed 2050 after 45.9270000458 Analysed 2100 after 47.4549999237 Completed duplicate analysis in: 70.2000000477 Found 0 duplicate groups covering 0 books Second run. No books are hashed. Started up in 2.77799987793 Pass 1: 2100 formats created 1 size collisions Analysed 50 after 0.202000141144 Analysed 100 after 0.202000141144 Analysed 150 after 0.202000141144 ... Analysed 2000 after 0.249000072479 Analysed 2050 after 0.249000072479 Analysed 2100 after 0.249000072479 Completed duplicate analysis in: 23.368999958 Found 0 duplicate groups covering 0 books |
Advert | |
|
04-29-2011, 02:55 PM | #216 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thx for that Charles, those numbers look much more representative which is great. That os.stat time is a killer isn't it?
Multiplying your numbers above out over the 7500 formats I was scanning that comes out at roughly 3 minutes for the first run hashing time is what I should roughly be expecting, plus the os.stat scan time. Hopefully much better than the 24 minutes total of before... I see Kovid has merged the changes now so I shall have an experiment later. |
04-29-2011, 03:21 PM | #217 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Code:
add_multiple_custom_book_data(self, name, vals, delete_first=False): The get_all_custom_book_data method does what you would expect. For completeness I also added delete_all_custom_book_data(self, name). It removes all records with 'name', regardless of book_id. |
|
04-29-2011, 08:29 PM | #218 |
e-Bibliophile
Posts: 60
Karma: 10
Join Date: Jun 2009
Location: California
Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2
|
I had another thought concerning finding duplicates, something I just ran across doing some normal cleanup. Some books, when imported merge the series and title information together.
So it's possible a book such as Dragon Champion by EE Knight, Age of Fire Book [1] can show up as: Age of Fire 1: Dragon Champion or Dragon Champion Book 1 in the Age of Fire saga. I know for a fact that the first option does not show up with the duplicate detector when it exists. Now, as with the earlier thought with the title/author switch, it is a partly a matter of metadata not being clean. However, with this the title may not seem wrong with a general scan, and it would be difficult for the QC plugin to find something like this. The point is that at some point in the future it may be a good addition to somehow match use both title and series in a scan for duplicates. While not even close to a top priority, it is a way for duplicates to go undetected, and I figure'd I'd mention it. Of course, anyone can use 'ignore title' to scan through this sort of stuff, but when I personally do that, I come up with over 700 matches. I'm probably just going overboard with thoughts, considering all the plugins you work on kiwi, but I'm more than willing to throw out stupid ass ideas. I am a little OC about things, and I want my database to be so damn clean it's impossible without extreme time spent checking the 50,000 books line by line. Up until a few days ago I had managed to not talk on the boards even though I've been a member of mobilereads for years, now... I can't keep my mouth shut it seems. Last edited by mehetabelo; 04-29-2011 at 10:57 PM. Reason: Fixing note |
04-30-2011, 05:38 AM | #219 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Mehetabelo - duplicate finder is not the right plugin for finding books with series in the title. Remember it is about finding duplicates. What if you only had one book called "Age of Fire 1: Dragon Champion"? None of duplicate checks algorithms could find that.
Quality Check or direct searches are your best bet. Quality Check already has a "Check titles with series" option which would find both those books above (you can customise it further in the QC options dialog). Yes it will also bring back other false positives and it has no exclusion capability (though I might consider adding that in future). Once you have your titles cleaned up, then run your duplicate check... |
04-30-2011, 04:28 PM | #220 |
Junior Member
Posts: 1
Karma: 10
Join Date: Nov 2010
Device: iPad
|
Technically, you're way ahead of me, but I'd like to applaud your effort. Duplicates are a big problem for me and eliminating them is time consuming. Your work will be very much appreciated here!
|
04-30-2011, 04:54 PM | #221 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Paul, welcome to MobileRead. Your thanks are appreciated. There will be another new version released on the "official" thread here very soon so keep an eye out...
|
04-30-2011, 05:11 PM | #222 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v1.0.3 Beta
This is a preview before an official 1.1 release. I would greatly appreciate it if a few people could give it a sanity check before I release it in the plugins forum thread. There have been a significant number of internal changes so I would rather get any problems found here first
You will need Calibre 0.7.59 as I have incorporated Chaley's additions to the API to speed up subsequent binary comparison scans. Changes in this release compared to 1.0:
Note that you will lose any exemptions you had set previously - I decided it wasn't worth the effort to try to migrate them. If you know that for a particular algorithm you have no duplicates in the results from using the plugin previously, then you can just run it again and mark all groups as exempt to store the new exemptions. Particular thanks to chaley for his code snippets and ideas for many of the changes. Any feedback much appreciated. Last edited by kiwidude; 05-02-2011 at 05:26 AM. Reason: Removed attachment as later version |
04-30-2011, 06:25 PM | #223 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
When I did a binary search on my books,
I made a filter for 218 books (the books you know are my non-duplicates with same name.) Surprise! In stead of scanning these books, I got a message: Scanning 8000 books for duplicates. The script is fast, but it is on my (high-loaded) nas so this is not a nice bug to have if you get it often. I just let it go on en will retest with other options and report any new errors if I find them By the way, nice GUI change!! |
04-30-2011, 06:30 PM | #224 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ahh, thx drMerry. The ISBN compare will have that same bug. Will put a new version up when fixed, I know why it is.
|
04-30-2011, 06:37 PM | #225 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
At the other hand, I go to sleep now, tomorrow I will (hopefully) have a list of all binary dups. So that is one thing I do not have to do again
(surprise 2. This bug is on 2 options. Both new. Hmmm could this be an accident?) status fixed? |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |