Ebook tagging service (idea phase)

jorm · 03-13-2012, 09:36 AM

I wanted to get the communities feedback on an idea and maybe get several suggestions before going about determining how to set it up and look for volunteers.

I have a very large ebook collection and am having a hard time tagging every book. I find that sometimes the authors name has part of the series name and it will not download correctly. To manually fix them all would take hundreds of hours (that is a conservative estimate)

I wanted the equivalent of freedb (music) for ebooks.

Here was the idea.

We build a webservice application that can do a lookup based on a sentence in the book. We compute the hashcode for that sentence and store it in a database with the link to the associated metadata for that book. Since we dont want to store every sentence in the book in the database we will look for common things like

Chapter 1, Part 1. or other keyword If we can not find those maybe just take sentences over 10 characters long for the first 5-10 pages?

To populate this database we would have to build a plugin and get volunteer to run it on their collections. For books that contain an isbn and have a cover, description and tag we check if it is in the database if not we add their data to our database. Very quickly we probably could get hundreds of thousands of books in a database.

Would also like to find out if there is a way to setup my own data
in the isbn field. For books like short stories where there is no isbn if someone manually tags it we would like to share it.

Interface

addBook(sentence, cover, metadata)
mergeBook(sentence, cover, metadata)
used for merging two sets of meta data. Makes sure that everything is populated.

containsBook(filename)
containsBook(sentence)
lookupBook(sentence, filename)

First has anyone tried anything like this yet. Seems that the content of the book is the only truly unique way to associated your copy of the book with mine as the same book if we can't both find the isbn.

Would people be willing to help populate this service by running it on their collections? Any developers interested in helping. I am more of a java/c# guy and would probably be more suited for the backend but would figure out how to write some python if necessary.

Feedback appreciated.

kovidgoyal · 03-13-2012, 11:22 AM

Setting up some kind of book fingerprint algorithm would be an interesting challenge. Off the top of my head, you could use:

Set of all proper nouns (defined as words with the first letter capitalized that are not at the start of a sentence). There would need to be some metric over the space of such sets that allows for close but not perfect matches.

I dont think you would have much success with a random sentence, as picking the same sentence in different formats of the books will be difficult, for example, the MOBI format could have a table of contents embedded at the begining, or a calibre conversion of the book could have an embedded metadata jacket.

jorm · 03-13-2012, 11:40 AM

Good point. Using a proper noun as an indicator for the sentence to check might be more effective.

They key is trying to be able to get to the start of the contents of the book and get past the table of contents, copyright etc....

You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).

kovidgoyal · 03-13-2012, 12:24 PM

Quote:

Originally Posted by jorm

You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).

The problem with sentences is how can you be sure you're past the front matter. Most ebooks do not mark their formtmatter, its just part of the body text.

I doubt we could come up with any reasonable scheme that would work across all books 100% of the time. The idea would be to come up with something that is 1) computationally cheap 2) much smaller than the book itself 3) Fairly robust.

I vaguely recall reading about fingerprinting for audio tracks. Which suggests some kind of statistical analysis of the text. Set of proper nouns, histogram of word frequncies (keep only the bottom 20 or so), average sentence length, number of punctuation marks, that kind of thing.

jorm · 03-13-2012, 12:39 PM

do you think we could identify a paragraph in a book by looking for several sentences that are place together with punctuation. Using simple rules for number of words per sentence. And counts of proper nouns. I might try to devise an algorithm and see if I can run it on a sample of books. And extract a real sentence. I can do pdf, epub, html and text since I can read those directly. I know calibre can read mobi but I have not figured out how to read it programatically yet.

If not we can move more into the fussy logic of word frequencies and proper names. This approach would be more interesting but require a lot more design and programming. In this approach do we try to process the header info as well. Or do we still try to make our way to the content.

Perhaps if we get the first option functional we can capture some metrics and then setup the service where we store the metrics and we can do some analysis on it to determine how much fussy variance do we allow.

I do want it to be computationally cheap. Because I want to encourage people to help populate the data with their data that they have already sorted and tagged so others can benefit.

I can do the backend service and database. If someone is familiar with python and plugin developing and willing to help that would be great.

kovidgoyal · 03-13-2012, 12:48 PM

Look at the word count calibre plugin, that will show you how to extract text from books of any format in calibre.

The problem with using a paragraph is once again one of identification. The algorithm is going to come up with a "signature" for the book, that signature has to be calculated independently against every instance of the book. How are you going to ensure that the algorithm picks the same paragrpah in every instance of the book? IOW, you algorithm picks paragraph number 23 in the epub version of the book and sends it as the signature to the server. Now the algorithm is running on another computer, where it has no access to what happened on the first computer, how will it know to pick the same paragraph for the same book to send the signature to the server?

theducks · 03-13-2012, 12:48 PM

Why not try the NCX first to possibly locate the Start via the Guide, then try for the standards (Prologue, Chapter... or TOC Lines that start with Digits ) and resort to a more complicated fallback?

jorm · 03-13-2012, 01:06 PM

i agree finding a consistent sentence or paragraph is the hard part. If we could do that consistently or at least 90% of the time we could probably complete this in a few days. Sweeping through the entire book counting proper names and tracking those counts may work but how fuzzy would we have to make that. What if there was an error in conversion and we lost a space between FirstName and said like FirstNamesaid. The count could be off.

So we have two approaches.

1. Try to get first paragraph or sentence.

Challenge : Might have a difficult time finding the first paragraph since we have TOC, headers, Copyrights etc.... Possible might not be 100% accurate.

2. Count of Proper nouns and maybe a couple of key word frequencies.
Challenge : how much leeway do we put here. If a conversion was not perfect would the count be off and we would not find a match? Do we only do this to the first x pages to limit processing power to determine this?

I can see the benefits of both approaches. The second is cleaner in the respect of we can process the header as well. However if one book has a header and the other copy does not we might not match.

However in that case someone else might have tagged it and we can find a match using that pattern as well. So if 80% of the time we capture the consistent magic sentence or paragraph the other 20% of the time we don't if one of those 20% of the time someone tagged that book we will have that sentence in our database as well.

I am open to both approaches just want to get a feedback on the best approach and move forward.

WillAdams · 03-13-2012, 01:11 PM

Punctuation will fail if you're using quotes since quotation styles differ by region.

Backi · 03-13-2012, 01:18 PM

Quote:

Originally Posted by jorm

Seems that the content of the book is the only truly unique way to associated your copy of the book with mine as the same book if we can't both find the isbn.

The problem is, that you can't always map contents to the container (here a book).

Speaking mathematically the mapping of containers to contents is a surjective function and is generally not reversible, i.e. the container/book is not always distinct:
With the sentences approach you could identify a closed unit (story, romance, poem), but not what container (book/anthology/collection) it is in, as the same story can be contained in more than one book.

To identify a container one have to consider the hash values of all items in it (that's how the hash of e.g. Java's List is computed). The problem is: How can you split a container's content into it's elements? Perhaps there would be always a blank page as separator between the items, but maybe not always. Also you can't know a priori, if it is a collection of different stories or a collection of chapters belonging to the same story. I think, it would be better to process somehow the TOC.

There could also be "foreign content" in a book, like quotes or proverbs. So taking a sentence might lead you to a different book identified.

jorm · 03-13-2012, 01:25 PM

It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books since we are working with data that we can logically see and read. Punctuation would be used in the identification of a sentence or paragraph not in the hash code. While it is true that if you have two books containing the same sentence if an anthology or story in your example we might have an issue. Some books might not have a TOC like a text file.

Backi · 03-13-2012, 01:33 PM

Quote:

Originally Posted by jorm

It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books since we are working with data that we can logically see and read.

Yes, but a song is a closed unit (as I wrote for poem, story etc.).
You couldn't say, which album or compilation it is from, except the very first container it was released in.

HarryT · 03-13-2012, 01:37 PM

Perhaps I'm being unduly cynical here, but it sounds to me as though this is something that would primarily be of use to people who download pirated books. The overwhelming majority of commercial eBooks that I've bought have had pretty good metadata.

Or have I misunderstood what's being requested?

jorm · 03-13-2012, 01:40 PM

ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.

Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.

jorm · 03-13-2012, 01:43 PM

Harry. Most of the books I am referring to are either short stories (no isbn), free books smashword or other creative common or public domain sources. Most of those I have to fix the data myself since they may not have been tagged properly when created.

03-13-2012, 09:36 AM	#1
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	Ebook tagging service (idea phase) I wanted to get the communities feedback on an idea and maybe get several suggestions before going about determining how to set it up and look for volunteers. I have a very large ebook collection and am having a hard time tagging every book. I find that sometimes the authors name has part of the series name and it will not download correctly. To manually fix them all would take hundreds of hours (that is a conservative estimate) I wanted the equivalent of freedb (music) for ebooks. Here was the idea. We build a webservice application that can do a lookup based on a sentence in the book. We compute the hashcode for that sentence and store it in a database with the link to the associated metadata for that book. Since we dont want to store every sentence in the book in the database we will look for common things like Chapter 1, Part 1. or other keyword If we can not find those maybe just take sentences over 10 characters long for the first 5-10 pages? To populate this database we would have to build a plugin and get volunteer to run it on their collections. For books that contain an isbn and have a cover, description and tag we check if it is in the database if not we add their data to our database. Very quickly we probably could get hundreds of thousands of books in a database. Would also like to find out if there is a way to setup my own data in the isbn field. For books like short stories where there is no isbn if someone manually tags it we would like to share it. Interface addBook(sentence, cover, metadata) mergeBook(sentence, cover, metadata) used for merging two sets of meta data. Makes sure that everything is populated. containsBook(filename) containsBook(sentence) lookupBook(sentence, filename) First has anyone tried anything like this yet. Seems that the content of the book is the only truly unique way to associated your copy of the book with mine as the same book if we can't both find the isbn. Would people be willing to help populate this service by running it on their collections? Any developers interested in helping. I am more of a java/c# guy and would probably be more suited for the backend but would figure out how to write some python if necessary. Feedback appreciated. Last edited by jorm; 03-13-2012 at 10:18 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
An idea about technical and reparing service	paula-t	enTourage eDGe	8	06-19-2011 06:55 PM
Ebook Idea - An Amazing Coincidence!	Diso	General Discussions	21	09-14-2010 12:52 PM
Idea for a $50 ebook reader	ashkulz	News	5	04-08-2007 11:08 AM
Site maintenance - first phase complete	Alexander Turcic	Announcements	1	12-06-2004 11:39 AM

03-13-2012, 11:22 AM	#2
kovidgoyal creator of calibre Posts: 45,431 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Setting up some kind of book fingerprint algorithm would be an interesting challenge. Off the top of my head, you could use: Set of all proper nouns (defined as words with the first letter capitalized that are not at the start of a sentence). There would need to be some metric over the space of such sets that allows for close but not perfect matches. I dont think you would have much success with a random sentence, as picking the same sentence in different formats of the books will be difficult, for example, the MOBI format could have a table of contents embedded at the begining, or a calibre conversion of the book could have an embedded metadata jacket.

03-13-2012, 11:40 AM	#3
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	Good point. Using a proper noun as an indicator for the sentence to check might be more effective. They key is trying to be able to get to the start of the contents of the book and get past the table of contents, copyright etc.... You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).

03-13-2012, 12:39 PM	#5
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	do you think we could identify a paragraph in a book by looking for several sentences that are place together with punctuation. Using simple rules for number of words per sentence. And counts of proper nouns. I might try to devise an algorithm and see if I can run it on a sample of books. And extract a real sentence. I can do pdf, epub, html and text since I can read those directly. I know calibre can read mobi but I have not figured out how to read it programatically yet. If not we can move more into the fussy logic of word frequencies and proper names. This approach would be more interesting but require a lot more design and programming. In this approach do we try to process the header info as well. Or do we still try to make our way to the content. Perhaps if we get the first option functional we can capture some metrics and then setup the service where we store the metrics and we can do some analysis on it to determine how much fussy variance do we allow. I do want it to be computationally cheap. Because I want to encourage people to help populate the data with their data that they have already sorted and tagged so others can benefit. I can do the backend service and database. If someone is familiar with python and plugin developing and willing to help that would be great.

03-13-2012, 12:48 PM	#6
kovidgoyal creator of calibre Posts: 45,431 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Look at the word count calibre plugin, that will show you how to extract text from books of any format in calibre. The problem with using a paragraph is once again one of identification. The algorithm is going to come up with a "signature" for the book, that signature has to be calculated independently against every instance of the book. How are you going to ensure that the algorithm picks the same paragrpah in every instance of the book? IOW, you algorithm picks paragraph number 23 in the epub version of the book and sends it as the signature to the server. Now the algorithm is running on another computer, where it has no access to what happened on the first computer, how will it know to pick the same paragraph for the same book to send the signature to the server?

03-13-2012, 12:48 PM	#7
theducks Well trained by Cats Posts: 31,115 Karma: 60406498 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Why not try the NCX first to possibly locate the Start via the Guide, then try for the standards (Prologue, Chapter... or TOC Lines that start with Digits ) and resort to a more complicated fallback?

03-13-2012, 01:06 PM	#8
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	i agree finding a consistent sentence or paragraph is the hard part. If we could do that consistently or at least 90% of the time we could probably complete this in a few days. Sweeping through the entire book counting proper names and tracking those counts may work but how fuzzy would we have to make that. What if there was an error in conversion and we lost a space between FirstName and said like FirstNamesaid. The count could be off. So we have two approaches. 1. Try to get first paragraph or sentence. Challenge : Might have a difficult time finding the first paragraph since we have TOC, headers, Copyrights etc.... Possible might not be 100% accurate. 2. Count of Proper nouns and maybe a couple of key word frequencies. Challenge : how much leeway do we put here. If a conversion was not perfect would the count be off and we would not find a match? Do we only do this to the first x pages to limit processing power to determine this? I can see the benefits of both approaches. The second is cleaner in the respect of we can process the header as well. However if one book has a header and the other copy does not we might not match. However in that case someone else might have tagged it and we can find a match using that pattern as well. So if 80% of the time we capture the consistent magic sentence or paragraph the other 20% of the time we don't if one of those 20% of the time someone tagged that book we will have that sentence in our database as well. I am open to both approaches just want to get a feedback on the best approach and move forward.

03-13-2012, 01:11 PM	#9
WillAdams Wizard Posts: 1,258 Karma: 3439432 Join Date: Feb 2008 Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12	Punctuation will fail if you're using quotes since quotation styles differ by region.

03-13-2012, 01:25 PM	#11
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books since we are working with data that we can logically see and read. Punctuation would be used in the identification of a sentence or paragraph not in the hash code. While it is true that if you have two books containing the same sentence if an anthology or story in your example we might have an issue. Some books might not have a TOC like a text file.

03-13-2012, 01:37 PM	#13
HarryT eBook Enthusiast Posts: 85,557 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Perhaps I'm being unduly cynical here, but it sounds to me as though this is something that would primarily be of use to people who download pirated books. The overwhelming majority of commercial eBooks that I've bought have had pretty good metadata. Or have I misunderstood what's being requested?

03-13-2012, 01:40 PM	#14
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging. Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.

Advert

Advert

03-13-2012, 01:43 PM	#15
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	Harry. Most of the books I am referring to are either short stories (no isbn), free books smashword or other creative common or public domain sources. Most of those I have to fix the data myself since they may not have been tagged properly when created.