MobileRead Forums - View Single Post - Pruning redundant and partially redundant tags

Sidetrack · 03-01-2013, 05:05 PM

I've been using the goodreads metadata download plugin to map tags to a hierarchy I like, and now I'd like to prune some of the redundant information out of the rest of my tags. So I'm looking for an elegant solution. I'm getting there with the regex replacement, but as stated, any more elegant solutions would be appreciated. I'm a little stumped on how to search for books that have redundant info on something better than a case-by-case basis.

example:

foo.fie, foo.fie.fum, foo, fum fie would become simply: foo.fie.fum
or
fiction, genre.crime, genre.mystery, genre.mystery.hard-boiled, crime, mystery, mystery & detective, hardboiled mystery
would become
genre.crime, genre.mystery.hardboiled

my regex is similar to this, though I've got a bit of a mishmash going with special cases:
template {tags} (\.[^\.,]+)(.*, )?([^,\.]*)\1; \1\2

I have to use separate search terms if the offending tags sort alphabetically before the genre tags

Any ideas on how to search for or otherwise identify books with partially redundant tags? Maybe a calculated column? How about some cleaner more robust replacement terms?

One other thing that bugs me is when I get info like the author's name or publisher mixed in as a tag when I've already got that information in it's appropriate column.

03-01-2013, 05:05 PM	#1
Sidetrack Enthusiast Posts: 39 Karma: 10 Join Date: Jan 2009 Location: South Pacific Device: Kindle DX	Pruning redundant and partially redundant tags I've been using the goodreads metadata download plugin to map tags to a hierarchy I like, and now I'd like to prune some of the redundant information out of the rest of my tags. So I'm looking for an elegant solution. I'm getting there with the regex replacement, but as stated, any more elegant solutions would be appreciated. I'm a little stumped on how to search for books that have redundant info on something better than a case-by-case basis. example: foo.fie, foo.fie.fum, foo, fum fie would become simply: foo.fie.fum or fiction, genre.crime, genre.mystery, genre.mystery.hard-boiled, crime, mystery, mystery & detective, hardboiled mystery would become genre.crime, genre.mystery.hardboiled my regex is similar to this, though I've got a bit of a mishmash going with special cases: template {tags} (\.[^\.,]+)(., )?([^,\.])\1; \1\2 I have to use separate search terms if the offending tags sort alphabetically before the genre tags Any ideas on how to search for or otherwise identify books with partially redundant tags? Maybe a calculated column? How about some cleaner more robust replacement terms? One other thing that bugs me is when I get info like the author's name or publisher mixed in as a tag when I've already got that information in it's appropriate column.