MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Library Management (https://www.mobileread.com/forums/forumdisplay.php?f=236)
-   -   Automated tag association (https://www.mobileread.com/forums/showthread.php?t=261782)

jon_joy_1999 06-15-2015 08:14 PM

Automated tag association
 
Hi, I have a bunch of (over 500) text documents that I've imported to the Calibre library, but they don't have any metadata associated with them (they are plain .txt files).
I would like to assign tags to documents based on their content. Basically if a document talks about bridges it is given a bridge tag, if it talks about roadways it is given a roadways tag, and if it talks about bridges and roadways it is given both tags.
How would I do this in Calibre?

BetterRed 06-15-2015 10:54 PM

Quote:

Originally Posted by jon_joy_1999 (Post 3117724)
Hi, I have a bunch of (over 500) text documents that I've imported to the Calibre library, but they don't have any metadata associated with them (they are plain .txt files).
I would like to assign tags to documents based on their content. Basically if a document talks about bridges it is given a bridge tag, if it talks about roadways it is given a roadways tag, and if it talks about bridges and roadways it is given both tags.
How would I do this in Calibre?

@jon_joy_1999 - As far as know there's no automatic way based on analysis of book contents

You could try downloading the metadata from one or more of the metadata source sites (Amazon, B&N, Goodreads etc) - but that probably only works for commercial publications

To do it manually - Enter the tags in Metadata Edit (Press E). with commas between tags eg 'bridges, roadways'. To the left of the tag field in Metadata Edit there's a button, if you click that, then you get a specialised Tag Editor that makes it easy to select previously defined tags - helps avoid ending up with 'roads' and 'roadways'.

You can also edit them directly in the book list by highlighting the cell and pressing F2, you can also press Shift+F2 on a Tags cell to get the Tags Editor.

BR

DaltonST 06-16-2015 12:17 AM

QuarantineAndScrub
 
The subject add-on has a Tags By Comments capability. Peruse its user guide for more info.

DaltonST

jon_joy_1999 06-16-2015 01:31 PM

Hello,
BetterRed, these aren't commercial publications unfortunately. I may have to go the manual route if I can't work out DaltonST's plugin

DaltonST, thanks, I'll take a look at that and post back my results.

BetterRed 06-16-2015 05:24 PM

1 Attachment(s)
Quote:

Originally Posted by jon_joy_1999 (Post 3118144)
Hello,
BetterRed, these aren't commercial publications unfortunately. I may have to go the manual route if I can't work out DaltonST's plugin

@jon_joy_1999 - had a feeling that was the case, if its not too late, and you had the originals organised around subject -- e.g. in separate directories -- you could re-add them is batches and make use of

Attachment 139361

BR

jon_joy_1999 06-16-2015 10:19 PM

hello, they're not sorted by subject. right now the directory listing is like

Documents
+Roadkill
|`Roadkill.txt
+Fezfez
|`Fezfez.txt
+Apricots and Bonds
`Apricots and Bonds.txt

DaltonST, I've read through the manual for Q&S and I see how it works with pre-existing metadata (title, tags, etc), but how do I have it use the contents of the file instead of the metadata as indicated in the manual?

BetterRed 06-17-2015 12:40 AM

Looks like you'll have to do the tagging based on your knowledge of the contents.

When I first created my main library it was with about 8000 'texts', and like you I had no downloadable metadata sources. I worked on the tagging progressively over a couple of months and ended up with about 30 tags.

BR

jon_joy_1999 06-17-2015 12:18 PM

alright, thanks BetterRed, if I worked at your rate I'd have these done in about a week. Do you know anything about the aforementioned addon Quarantine & Scrub?

eschwartz 06-17-2015 12:51 PM

Quarantine&Scrub has a long and complex user guide. It seems to have niche appeal and TBH I am not sure how many people understand it.

BetterRed 06-17-2015 06:37 PM

Quote:

Originally Posted by jon_joy_1999 (Post 3118731)
alright, thanks BetterRed, if I worked at your rate I'd have these done in about a week. Do you know anything about the aforementioned addon Quarantine & Scrub?

Only what I've read in the manual - the nearest thing is probably tags from comments. As I understand it you define words pairs, if first word is in comments then the second word is used as a tag. So you might have it set up such that -- track, street, turnpike, and motorway etc -- result in book being tagged as 'Roads'.

I have reservations about whether that approach would work for contents without contextual analysis - Kerouac's On the Road ain't about "Roads". My initial inclination is that that would require human intervention. But here's a patent aimed at automation, the citations might find some implementations

Patent US6199081 - Automatic tagging of documents and exclusion by content

And here's an interesting pdf paper from a Taxomony consultant Taxonomies for Auto-Tagging Unstructured Content

They might inspire someone to write something

BR

jon_joy_1999 06-18-2015 03:31 PM

Quote:

Originally Posted by eschwartz (Post 3118755)
Quarantine&Scrub has an incredibly long user guide and the creator expects you to read it all, doesn't like having to explain it. It seems to have niche appeal and TBH I am not sure how many people understand it.

I read the user guide provided and didn't see any functions that suggested they would do what I wanted, that's why I asked how DaltonST I would use it with the contents of the file.
I'm new here, but I'm already disappointed that a developer would suggest a plugin that doesn't do what he said.
Quote:

Originally Posted by BetterRed (Post 3118950)
Only what I've read in the manual - the nearest thing is probably tags from comments. As I understand it you define words pairs, if first word is in comments then the second word is used as a tag. So you might have it set up such that -- track, street, turnpike, and motorway etc -- result in book being tagged as 'Roads'.

I have reservations about whether that approach would work for contents without contextual analysis - Kerouac's On the Road ain't about "Roads". My initial inclination is that that would require human intervention. But here's a patent aimed at automation, the citations might find some implementations

Patent US6199081 - Automatic tagging of documents and exclusion by content

And here's an interesting pdf paper from a Taxomony consultant Taxonomies for Auto-Tagging Unstructured Content

They might inspire someone to write something

BR

These files don't even have comments associated with them. As of right now I'm using Agent Ransack to search through the files for keywords and then manually applying tags within Calibre.

That patent seems to be outside the scope of my needs. I'm not using a network to store these files.

The Hedden document is a slideshow presentation of talking points, not something I could use

BetterRed 06-18-2015 06:27 PM

Quote:

Originally Posted by jon_joy_1999 (Post 3119565)
I read the user guide provided and didn't see any functions that suggested they would do what I wanted, that's why I asked how DaltonST I would use it with the contents of the file.
I'm new here, but I'm already disappointed that a developer would suggest a plugin that doesn't do what he said.

These files don't even have comments associated with them.

Given your documents are text documents - why don't you try pasting a couple of them (yes the whole document) into the corresponding Comments column and then experiment with the Q&S Tags from Comments facility - after which you can remove the text from Comments.

I've no idea if DaltonST had this in mind when he suggested you take a look at his PI. If you look through the version history of the PI you'll see there have been many enhancements - many of which stemmed from posts such as yours.

Quote:

Originally Posted by jon_joy_1999 (Post 3119565)
As of right now I'm using Agent Ransack to search through the files for keywords and then manually applying tags within Calibre.

I use Windows Search in a similar way as you're using Ransack. When I get interested in something I do the relevant searches, I save the results paths to the clip board, paste that into Notepad++ and make it into a csv that I read with the Import List PI to create a Reading List, add a Tag etc.

The Calibre (GUI and Command Line) and it's PI's provide 'canned' solutions to many problems. But they also provide a rich set of tools, which with a bit of lateral thinking enable the user to fashion their own solutions.

BR

DaltonST 08-06-2015 10:45 AM

1 Attachment(s)
@jon_joy_1999:


If you haven't finished manually creating Tags within Calibre for your 500 text files, this might help you by creating Comments and Tags automatically using a list of the 'Top N Nouns' in each text file, sorted by frequency in descending order:


[GUI Plugin] English Noun Frequency : https://www.mobileread.com/forums/sho...d.php?t=263684


A typical example for a Factual/Non-fiction book is attached just below.





DaltonST


All times are GMT -4. The time now is 09:19 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.