Quote:
Originally Posted by nrapallo
A good place to start analyzing is our ebook listing in html or txt format.
|
After removing the most obvious duplicates from the txt (delete the format, delete the date, delete the "IMP" and "epub" labels, delete any identical consecutive lines), I get some 4540 books (less than 50% of the total), that's still an upper bound, as there are duplicates remaining (where the titles differ in capitalization or punctuation, or where different versions are uploaded in different threads).