MobileRead Forums - View Single Post

odradek · 11-16-2015, 09:30 AM

I was glad to see a request for a BISAC scraper. I'd like to see that and something similar for Library of Congress Subject Headings.

I have a hierarchical column called "LoC Subject Headings" (#locsh) of type "Comma separated text, like tags, shown in the tag browser". I'd like a button that populates it automatically.

The data can be scraped from WorldCat and LoC websites. For example, searching for "The Andy Warhol Diaries", WorldCat returns:

"Warhol, Andy, -- 1928-1987 -- Diaries.
Artists -- United States -- Diaries.
Artists -- United States -- Biography.
Warhol, Andy, -- 1928-1987.
Artists.
United States."

The Library of Congress returns:

"Warhol, Andy, 1928-1987 --Diaries.
Artists --United States --Diaries."

Some regex could massage these into:

"Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries,Artists.United States.Biography,Warhol/ Andy 1928-1987,Artists,United States"

and

"Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries"

(Note how ',' within tags must be handled, and the format of tags for a person.) and these could be sent to the #locsh column.

Similarly, for BISAC, an Amazon search returns (and there may be other sources than Amazon):

"#52 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers
#246 in Books > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers
#924 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Professionals & Academics"

Which could be processed into:

"Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Professionals & Academics"

and added to a BISAC (#bisac) custom column.

Incidentally, the LoC also has a similar field called "Genre/Form Terms", but these haven't been widely worked out, and it is usually empty. News on them is here.

I think there are similar plugins and this shouldn't be too hard for a good Python programmer. How about it?

11-16-2015, 09:30 AM	#797
odradek Member Posts: 12 Karma: 10 Join Date: Apr 2011 Device: odradek	[Plugin Request] scraper for Library of Congress Subject Headings I was glad to see a request for a BISAC scraper. I'd like to see that and something similar for Library of Congress Subject Headings. I have a hierarchical column called "LoC Subject Headings" (#locsh) of type "Comma separated text, like tags, shown in the tag browser". I'd like a button that populates it automatically. The data can be scraped from WorldCat and LoC websites. For example, searching for "The Andy Warhol Diaries", WorldCat returns: "Warhol, Andy, -- 1928-1987 -- Diaries. Artists -- United States -- Diaries. Artists -- United States -- Biography. Warhol, Andy, -- 1928-1987. Artists. United States." The Library of Congress returns: "Warhol, Andy, 1928-1987 --Diaries. Artists --United States --Diaries." Some regex could massage these into: "Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries,Artists.United States.Biography,Warhol/ Andy 1928-1987,Artists,United States" and "Warhol/ Andy 1928-1987.Diaries,Artists.United States.Diaries" (Note how ',' within tags must be handled, and the format of tags for a person.) and these could be sent to the #locsh column. Similarly, for BISAC, an Amazon search returns (and there may be other sources than Amazon): "#52 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers #246 in Books > Biographies & Memoirs > Arts & Literature > Artists, Architects & Photographers #924 in Kindle Store > Kindle eBooks > Biographies & Memoirs > Professionals & Academics" Which could be processed into: "Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Arts & Literature.Artists/ Architects & Photographers,Biographies & Memoirs.Professionals & Academics" and added to a BISAC (#bisac) custom column. Incidentally, the LoC also has a similar field called "Genre/Form Terms", but these haven't been widely worked out, and it is usually empty. News on them is here. I think there are similar plugins and this shouldn't be too hard for a good Python programmer. How about it?