MobileRead Forums - View Single Post

Krazykiwi · 07-21-2017, 06:34 PM

I ran a large Calibre hosting a pretty large library for a text corpus for a research project, which isn't a million miles from similar usage. I agree pretty much with the above method.

I'm a little unclear if you mean things like storing textbooks and user manuals centrally, like an internal reference library, or as Harry suggest, more like company generated documents like reports etc., so this is probably more directed towards the latter.

A couple of things you might want to think about:
Think about the metadata you need now, before you set up, especially if it's likely to differ from standard book metadata, because it's a lot easier to do that from scratch than retrofit it later. For instance, will you need to keep multiple revisions of documents? Consider adding a "Revision" or "Edition" column. It's better to collect metadata you don't need than to figure out 6 months and several thousand docs down the line that you really ought to have X stored, because people keep asking for docs that fit X.

Try to have rigorous requirements for incoming file naming, because it will greatly speed up the librarian's work if you can just parse most of the useful metadata straight out of the filename. We did have a central incoming folder that anyone could put docs in, but I also ran a cron job that pulled out anything that didn't meet the requirements so that it wasn't imported until it was fixed. For instance, if date or department is a requirement in your metadata for the library to be useful, then require it in the incoming filenames, and specify the format clearly and simply so people will use it.

The calibre web interface (particularly the new one!) is great for a file at a time, but if you are talking about documents like business reports, prepare for the necessity to pull multiple documents at a time. Since that was pretty much the reason for our project, we set up a separate web form for the research team to use so they could request a subset of documents based on arbitrary criteria ("This tag or this tag and published between this date and that date and more than 10k words but not more than 100k words and in English"), which got mailed to the librarian (i.e., me) and I could then spit out the results in the desired format via save to disk back to the network server. A web form might be overkill for your case, but an email template for people to fill in stuck up on your intranet somewhere will make this task a lot easier and faster.

I can probably think of more things, but those are the ones that can really be a time-suck if you haven't planned sufficiently, so I hope that helps.

07-21-2017, 06:34 PM	#8
Krazykiwi Zealot Posts: 137 Karma: 2156958 Join Date: Jan 2013 Device: Too many random androids to list	I ran a large Calibre hosting a pretty large library for a text corpus for a research project, which isn't a million miles from similar usage. I agree pretty much with the above method. I'm a little unclear if you mean things like storing textbooks and user manuals centrally, like an internal reference library, or as Harry suggest, more like company generated documents like reports etc., so this is probably more directed towards the latter. A couple of things you might want to think about: Think about the metadata you need now, before you set up, especially if it's likely to differ from standard book metadata, because it's a lot easier to do that from scratch than retrofit it later. For instance, will you need to keep multiple revisions of documents? Consider adding a "Revision" or "Edition" column. It's better to collect metadata you don't need than to figure out 6 months and several thousand docs down the line that you really ought to have X stored, because people keep asking for docs that fit X. Try to have rigorous requirements for incoming file naming, because it will greatly speed up the librarian's work if you can just parse most of the useful metadata straight out of the filename. We did have a central incoming folder that anyone could put docs in, but I also ran a cron job that pulled out anything that didn't meet the requirements so that it wasn't imported until it was fixed. For instance, if date or department is a requirement in your metadata for the library to be useful, then require it in the incoming filenames, and specify the format clearly and simply so people will use it. The calibre web interface (particularly the new one!) is great for a file at a time, but if you are talking about documents like business reports, prepare for the necessity to pull multiple documents at a time. Since that was pretty much the reason for our project, we set up a separate web form for the research team to use so they could request a subset of documents based on arbitrary criteria ("This tag or this tag and published between this date and that date and more than 10k words but not more than 100k words and in English"), which got mailed to the librarian (i.e., me) and I could then spit out the results in the desired format via save to disk back to the network server. A web form might be overkill for your case, but an email template for people to fill in stuck up on your intranet somewhere will make this task a lot easier and faster. I can probably think of more things, but those are the ones that can really be a time-suck if you haven't planned sufficiently, so I hope that helps.