View Single Post
Old 11-16-2015, 09:30 AM   #797
odradek
Junior Member
odradek began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Apr 2011
Device: odradek
[Plugin Request] scraper for Library of Congress Subject Headings

I was glad to see a request for a BISAC scraper. I'd like to see that and something similar for Library of Congress Subject Headings.


I have a hierarchical column called "LoC Subject Headings" (#locsh) of type "Comma separated text, like tags, shown in the tag browser". I'd like a button that populates it automatically.

The data can be scraped from WorldCat and LoC websites. For example, searching for "The Andy Warhol Diaries", WorldCat returns:

"Warhol, Andy, -- 1928-1987 -- Diaries.
Artists -- United States -- Diaries.
Artists -- United States -- Biography.
Warhol, Andy, -- 1928-1987.
Artists.
United States."

The Library of Congress returns:

"Warhol, Andy, 1928-1987 --Diaries.
Artists --United States --Diaries."

Some regex could massage these into:

"Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries,Artists.United States.Biography,Warhol/ Andy 1928-1987,Artists,United States"

and

"Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries"

(Note how ',' within tags must be handled, and the format of tags for a person.) and these could be sent to the #locsh column.


Similarly, for BISAC, an Amazon search returns (and there may be other sources than Amazon):

"#52 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers
#246 in Books > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers
#924 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Professionals & Academics"

Which could be processed into:

"Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Professionals & Academics"

and added to a BISAC (#bisac) custom column.


Incidentally, the LoC also has a similar field called "Genre/Form Terms", but these haven't been widely worked out, and it is usually empty. News on them is here.

I think there are similar plugins and this shouldn't be too hard for a good Python programmer. How about it?
odradek is offline   Reply With Quote