View Single Post
Old 05-21-2015, 12:01 PM   #710
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Author Metadata

Quote:
Originally Posted by Ethanaul View Post
I'd like to store informations about authors, as:
  • nicknames (or alternative author name),
  • original country
  • writing language,
  • website,
  • other ideas ?

Some explanations
Nicknames
The nicknames could be interresting to search the books from the same author, even if it wrote with another name. As J.K. Rowling (aka Kennilworthy Whisp, Newt Scamander and Robert Galbraith) or Robin Hobb (aka Megan Lindholm).

Original country
I had a trouble with Robin Cook, two autors use this name, one is from UK (real name: Derek Raymond) and one is US.

@Ethanaul:

For my own personal amusement, a few months ago I created a wxPython/Python/SQLite app with its own database that I called "Author Metadata".

The good news: The database was populated entirely from web scraping.

The bad news: The database was populated entirely from web scraping.

See the attached example of Anne Rice. You can imagine the data quality issues across the web. For example, some URLs call her "Ann Rice" instead of "Anne Rice". Some say that her birth name was "Anne Rice", which is not true. Some web sources have a book's published date, and some do not, or they have it but it is incorrect. And so on. And so on.

The author images scraped from the web were automatically stored in the database as zip-compressed Base64 text of a standard size of 200x200, but the database was still about 50MB when I decided to stop populating it due to data consistency and quality concerns as a result of pure web scraping.

Manually cleaning, standardizing and maintaining such a thing would require too much lifespan to accomplish.

I had a lot of fun doing what I did, and accomplished what I set out to do, which was to have a lot of fun. From this point forward, it would require really unpleasant manual work to clean up after the web scraping. I have better things to do. Life is too short.

I was able to use what clean data I already had to create a large "reference validation data service pack" for my QuarantineAndScrub add-on to Calibre that added the Global Authors, Global Series and Web Source Series Validation Data from the "Author Metadata" database to the special Q&S metadata.db.

Finally, I do have something that may interest you. You mentioned you would like a list of Author Nicknames. Attached is a .zip file that contains a .csv file and a .sql file. Both were exported from the above database, and came from its global pseudonym table that was built from web scraping. Use them if and as you wish.


DaltonST
Attached Thumbnails
Click image for larger version

Name:	anne_rice.JPG
Views:	297
Size:	304.3 KB
ID:	138518  
Attached Files
File Type: zip global pseudonyms.zip (25.7 KB, 309 views)
DaltonST is offline   Reply With Quote