![]() |
#196 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Right - I suspected the read above... thats what you get for copying code off the web
![]() This is where I stole it from... http://www.gossamer-threads.com/list.../python/739198 Last edited by kiwidude; 04-28-2011 at 02:32 PM. |
![]() |
![]() |
![]() |
#197 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Cool! It's amazing how subtle something like this can be. He's probably getting ^Z near a standard file header (possibly in the .doc files?), the files end up with identical content and the hash matches.
|
![]() |
![]() |
![]() |
#198 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Code:
$ python Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) [GCC 4.3.3] on linux2 |
|
![]() |
![]() |
![]() |
#199 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Phew, all fixed now, thx guys.
I guess the next decision is whether to make this a background job or just make the user wait. On a 1500 book/4000 format library it takes around 15 seconds. |
![]() |
![]() |
![]() |
#200 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
On a more serious note, as you are almost certainly using os.stat, you have the mtime as well as the size. You may consider storing those two values and the hash (I imagine you got rid of the double hash) of each format with the book, using the plugin storage facility. Check the size+mtime before recomputing the hash, and use the stored hash if the values haven't changed. |
|
![]() |
![]() |
![]() |
#201 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Bonza idea.
|
![]() |
![]() |
![]() |
#202 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ok, maybe not such a good idea
![]() Having the extra calls in for add_custom_book_data and get_custom_book_data means it is averaging 10 seconds per 50 books. So instead of taking about 4 minutes on that first run it takes 23 minutes. ![]() And subsequent runs are still around the 4 minute mark. I commented the code all out and ran the analysis again - 2.5 minutes. Ahh well - users can just go make themselves a cuppa on their large libraries ![]() |
![]() |
![]() |
![]() |
#203 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Could you post the code you are using? Or give me a 'broken' copy of the plugin? I want to figure out what is going on. |
|
![]() |
![]() |
![]() |
#204 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ok, here is a version for you to play with - I haven't started work on the exemption group changes yet but the new gui around isbn/binary comparisons etc is done. The code you will be interested in is in algorithms.py around line 500 or so. I have made no attempts to optimise (there weren't a whole lot of options to do so with the current API for plugin book data as we've talked about previously when I used to use it on the goodreads plugin). It does seem freakishly slow adding those lines in though.
Last edited by kiwidude; 04-30-2011 at 04:55 PM. |
![]() |
![]() |
![]() |
#205 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
@kiwidude: did you run the create experiment more than once? There seems to be a first-time problem. I ran the test, and it created a few duplicate groups, taking 22 seconds! I then deleted the records from the DB using SQLiteSpy and ran it again. This time it took 1.3 seconds. Around 15 more runs all produce the same number.
I am thinking that the first time it runs, it needs to auto-create the indices or some such. Do you have evidence one way or the other? |
![]() |
![]() |
![]() |
#206 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I have run it many, many times - and killed it many times
![]() |
![]() |
![]() |
![]() |
#207 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Charles. One further thought to throw into the mix on the performance stuff. I don't know how big your database was, but I have found that for smaller size databases, there is a certain amount of (os?) caching which takes place that can significantly affect things.
To further explain what I mean - with a 1500 book (4200 format) database, the first time I do a scan it takes around 13-15 seconds. Of that, the majority of the time is spent in the first pass doing os.stat on those files to get the file size. If I then run that check again, the check runs in 1.5 seconds. Which approach I use to analysing size duplicates (always or via book plugin data) is pretty immaterial in this situation - as there was only about 22 books or so that had size collisions. The same number of files have had os.stat run on them, but due to presumably some lower level os caching that check completed extremely quickly. However for my large test database, it would appear that with 75000 formats to get the file size of, the caching has negligible effect. So the first pass of os.stat takes about the same time when you run it repeatedly. My point being that with the numbers you had above, your dramatically improved performance 15 times in a row etc could just be because of the caching effect. I'm going to disable using book plugin data unless we can nail down its exact problem, the performance cost is orders of magnitude too high at this point. |
![]() |
![]() |
![]() |
#208 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,374
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I would suggest minimizing writes to the db. i.e. keep your lists in memory until the end of the search and only then write to the database, preferably with a single executemany call.You probably need to add another API method to database2 for that.
|
![]() |
![]() |
![]() |
#209 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
Maybe a second db that is accessible by the Calibre API. Reason: (other topic but related to your remark) Spoiler:
|
|
![]() |
![]() |
![]() |
#210 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,374
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Having lots of data in a db does not make it slow, unless the db is very poorly designed. And there is nothing preventing a plugin from using its own db if it feels like it.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |