Kindle source code observations
On March 29, I downloaded all the (eink) Kindle_src*.* GPL tarballs from the amazon "Source Code Notice" web page, after doing a "sort|uniq" on them (because the same URLs are used for multiple kindle models).
My script took more than half a day to complete, even with my 40Mbps internet connection. The result was a 20GB directory full of .tar.gz (and some .tar) files. Early kindle tarballs were not compressed, which is unimportant because all the tarballs inside them ARE compressed (mostly .tar.bz2, or .tar.gz in older firmwares). Interestingly, most source code tarballs also contain a .ipk file, which expands into a (mostly) empty root filesystem (with a few small "default" files in /etc).
Last night (May 29), I downloaded all the eink kindle GPL tarballs again, but this time I made a script to extract the URL list from the web page, and download all the files in that list (keeping a log file this time). Before the "sort|uniq" there were 131 URLs, but 89 files to download with unique URLs.
Even though I avoided downloading files with duplicate URLs, doing "md5sum" on my downloads shows a bunch of duplicate files (16 or the with different "version number" names.
Obviously, there were addition firmware versions (more files) to download in the recent batch, as was to be expected.
Even more interesting is when I compare my recent GPL source code set with that from two months ago. First off, I see that some filenames changed -- previously they contained FW version (the numbers with all the dots in them) and OTA version (that long string of trailing digits), but now some of them have had their OTA version string stripped off (giving a different download URL for the same FW version).
Another thing interesting is that SOME files that have the SAME URL as they did two months ago, now have a DIFFERENT md5sum than the same URL from the previous download set (but the SAME URL as a slightly newer firmware version with a different URL, therefore a duplicate download).
So, it seems that amazon likes to "rewrite history" for their source code, just like they do for old firmware update downloads. Same URL, but different content from previous archived downloads.
Now, when I expand all those meta-tarballs (tarballs full of tarballs), I expect to have a HUGE amount of duplication, because the inner tarballs also contain version numbers in their filenames, and the filenames have a huge amount of duplication between firmware versions. HOWEVER, I would NOT be surprised to see different contents for some of these identically-named tarballs.
One additional point of interest -- these tarball files typically contain a folder. The outer tarballs call this inner folder either "gplresults" or "gplrelease", with no mention of the firmware version on them. So my "unpacker" script renames the "gpl*" folder to the "Kindle_src*" base filename (with .tar* stripped off the end). In general, the inner tarballs expand into a folder matching their base filename, except for the "linux*" kernal folder which has "-lab126" stripped off the inner folder name (no longer matching the rootfs folder where kernel modules are stored).
I will post my scripts and such later after I clean them up a bit, in this first post. Though to save bandwidth for others wishing to study ALL the source code, I plan to replace duplicates in the entire set with symlinks, then stuff it all into a bit .xz tarball (or perhaps a mult-part download). Because it is GPL, I can reupload my de-duplicated code set to a site that can support big files (preferrably with download-resume capability).
I wonder how big my 20GB of (inner) .tar.gz and .tar.bz2 files will get after I unpack them?
Last edited by geekmaster; 05-30-2016 at 03:03 PM.
|