Squeezing as much performance as possible out (container)
Hey folks-
TL;DR: Make your calibre faster by putting /tmp in a ramdisk and throwing more resources at it.
In the process of validating a few hundred thousand ebooks by restructuring my library, and figured I would take some time to run some performance tests. Not sure how much this would help anyone (they are very tailored to my setup), but here are my results. If anyone has any further ideas on how to go even further (i.e. custom compiling maybe) I'm all ears. See below for my results, and sorry but I have no clue how to make tables in this forum.
My setup:
Host machine: kubuntu 18.10
Virtualization: libvirt-lxc
Root FS: ssd-backed XFS
Library location: local Raid6 MDADM mount
Guest: ubuntu 18.10 LXC
RAM: 1G
CPU Cores: 1
Library location: direct mount of folder
So the majority of what I am doing in the Calibre GUI is the metadata download, which as most of you know will probably finish a few hundred thousand books by the time the glaciers are melted. That's all I'm really using it for, and I'm containerizing it because I don't like the idea of the Python2 dependency (but that's neither here nor there).
Plugins installed:
Barne's & Noble
Count Pages
Find Duplicates
Goodreads
Quality Check
So now for the data. I ran a combined metadata + covers download for each change I made, on the same 63 eBooks. Below are the changes I made, and the resulting completion time after the change. Notice how a lot of these are specific to the metadata downloading.
vanilla (no changes): 7:10
with CALIBRE_TEMP_DIR as ramdisk: 6:48
with only Amazon, GoodReads and Google as sources: 6:11
with no tag download in metadata download: 7:23
download only metadata: 3:48
download only covers: 5:01
Amazon only with Amazon servers: 4:34 (51 found)
Amazon only with Google cache: 4:56 (49 found)
Amazon only with Bing cache: 6:05 (50 found)
with debug mode turned on: 6:25
with db in ramdisk (symlink): 7:22
Author, comment, rating title metadata (plugin): 9:21
Author, comment, rating title metadata (plugin+global): 7:22
Expand RAM from 1g to 4g: 6:28
Mount all of /tmp under a ramdisk (loop driver): 6:58
Mount all of /tmp under a ramdisk (nbd/raw driver): 6:34
Mount all of /tmp under a ramdisk (mount): 7:09
Expand core count from 1 to 8: 7:02
Expand core count from 1 to 8 and jobs from 3 to 16: 7:12
Honestly, a lot of these tweaks had little to no effect. Then I started looking more at the logs for each metadata download and started parsing them to see if there were any bottlenecks with any of the plugins/providers:
Source Type Total Average Median
-------------------------------------------
Amazon Metadata 200.9 3.2 1.9
Amazon Covers 252.3 4.0 3.1
B&M Metadata 126.6 2.0 1.6
B&M Covers 104.8 1.7 1.3
Good Metadata 46.4 0.8 0.6
Good Covers 79.6 1.2 0.7
Google Metadata 32.5 0.5 0.4
Google Covers 59.3 0.9 0.7
The Amazon plugin is disgustingly slow compared to the other three. Even if I changed the default "automatic" server selection to the amazon servers, it helped tremendously but it wasn't good enough since we go through each plugin no matter what.
With this in mind, here is what I ended up with as a final state:
End spec: 1:55 to complete !!
4096M /tmp
4096M RAM
8 cpus
removed amazon as a source
... and that is pretty much it. As suggested in a different thread, sticking all of /tmp in a ramdisk is pretty superior to setting the CALIBRE_TEMP_DIR. I'm sure in the future I'll be throwing the database on there as well with a copy/symlink, but for now it doesn't help enough to be worth the hassle.
|