Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 03-13-2012, 10:36 AM   #1
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
Ebook tagging service (idea phase)

I wanted to get the communities feedback on an idea and maybe get several suggestions before going about determining how to set it up and look for volunteers.

I have a very large ebook collection and am having a hard time tagging every book. I find that sometimes the authors name has part of the series name and it will not download correctly. To manually fix them all would take hundreds of hours (that is a conservative estimate)

I wanted the equivalent of freedb (music) for ebooks.

Here was the idea.

We build a webservice application that can do a lookup based on a sentence in the book. We compute the hashcode for that sentence and store it in a database with the link to the associated metadata for that book. Since we dont want to store every sentence in the book in the database we will look for common things like

Chapter 1, Part 1. or other keyword If we can not find those maybe just take sentences over 10 characters long for the first 5-10 pages?

To populate this database we would have to build a plugin and get volunteer to run it on their collections. For books that contain an isbn and have a cover, description and tag we check if it is in the database if not we add their data to our database. Very quickly we probably could get hundreds of thousands of books in a database.


Would also like to find out if there is a way to setup my own data
in the isbn field. For books like short stories where there is no isbn if someone manually tags it we would like to share it.

Interface

addBook(sentence, cover, metadata)
mergeBook(sentence, cover, metadata)
used for merging two sets of meta data. Makes sure that everything is populated.

containsBook(filename)
containsBook(sentence)
lookupBook(sentence, filename)

First has anyone tried anything like this yet. Seems that the content of the book is the only truly unique way to associated your copy of the book with mine as the same book if we can't both find the isbn.


Would people be willing to help populate this service by running it on their collections? Any developers interested in helping. I am more of a java/c# guy and would probably be more suited for the backend but would figure out how to write some python if necessary.

Feedback appreciated.

Last edited by jorm; 03-13-2012 at 11:18 AM.
jorm is offline   Reply With Quote
Old 03-13-2012, 12:22 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,436
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Setting up some kind of book fingerprint algorithm would be an interesting challenge. Off the top of my head, you could use:

Set of all proper nouns (defined as words with the first letter capitalized that are not at the start of a sentence). There would need to be some metric over the space of such sets that allows for close but not perfect matches.

I dont think you would have much success with a random sentence, as picking the same sentence in different formats of the books will be difficult, for example, the MOBI format could have a table of contents embedded at the begining, or a calibre conversion of the book could have an embedded metadata jacket.
kovidgoyal is offline   Reply With Quote
 
Advertisement
Old 03-13-2012, 12:40 PM   #3
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
Good point. Using a proper noun as an indicator for the sentence to check might be more effective.

They key is trying to be able to get to the start of the contents of the book and get past the table of contents, copyright etc....

You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).
jorm is offline   Reply With Quote
Old 03-13-2012, 01:24 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,436
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by jorm View Post
You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).
The problem with sentences is how can you be sure you're past the front matter. Most ebooks do not mark their formtmatter, its just part of the body text.

I doubt we could come up with any reasonable scheme that would work across all books 100% of the time. The idea would be to come up with something that is 1) computationally cheap 2) much smaller than the book itself 3) Fairly robust.

I vaguely recall reading about fingerprinting for audio tracks. Which suggests some kind of statistical analysis of the text. Set of proper nouns, histogram of word frequncies (keep only the bottom 20 or so), average sentence length, number of punctuation marks, that kind of thing.
kovidgoyal is offline   Reply With Quote
Old 03-13-2012, 01:39 PM   #5
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
do you think we could identify a paragraph in a book by looking for several sentences that are place together with punctuation. Using simple rules for number of words per sentence. And counts of proper nouns. I might try to devise an algorithm and see if I can run it on a sample of books. And extract a real sentence. I can do pdf, epub, html and text since I can read those directly. I know calibre can read mobi but I have not figured out how to read it programatically yet.

If not we can move more into the fussy logic of word frequencies and proper names. This approach would be more interesting but require a lot more design and programming. In this approach do we try to process the header info as well. Or do we still try to make our way to the content.

Perhaps if we get the first option functional we can capture some metrics and then setup the service where we store the metrics and we can do some analysis on it to determine how much fussy variance do we allow.

I do want it to be computationally cheap. Because I want to encourage people to help populate the data with their data that they have already sorted and tagged so others can benefit.

I can do the backend service and database. If someone is familiar with python and plugin developing and willing to help that would be great.
jorm is offline   Reply With Quote
Old 03-13-2012, 01:48 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,436
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Look at the word count calibre plugin, that will show you how to extract text from books of any format in calibre.

The problem with using a paragraph is once again one of identification. The algorithm is going to come up with a "signature" for the book, that signature has to be calculated independently against every instance of the book. How are you going to ensure that the algorithm picks the same paragrpah in every instance of the book? IOW, you algorithm picks paragraph number 23 in the epub version of the book and sends it as the signature to the server. Now the algorithm is running on another computer, where it has no access to what happened on the first computer, how will it know to pick the same paragraph for the same book to send the signature to the server?
kovidgoyal is offline   Reply With Quote
Old 03-13-2012, 01:48 PM   #7
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,268
Karma: 6020309
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Why not try the NCX first to possibly locate the Start via the Guide, then try for the standards (Prologue, Chapter... or TOC Lines that start with Digits ) and resort to a more complicated fallback?
theducks is offline   Reply With Quote
Old 03-13-2012, 02:06 PM   #8
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
i agree finding a consistent sentence or paragraph is the hard part. If we could do that consistently or at least 90% of the time we could probably complete this in a few days. Sweeping through the entire book counting proper names and tracking those counts may work but how fuzzy would we have to make that. What if there was an error in conversion and we lost a space between FirstName and said like FirstNamesaid. The count could be off.

So we have two approaches.

1. Try to get first paragraph or sentence.

Challenge : Might have a difficult time finding the first paragraph since we have TOC, headers, Copyrights etc.... Possible might not be 100% accurate.

2. Count of Proper nouns and maybe a couple of key word frequencies.
Challenge : how much leeway do we put here. If a conversion was not perfect would the count be off and we would not find a match? Do we only do this to the first x pages to limit processing power to determine this?



I can see the benefits of both approaches. The second is cleaner in the respect of we can process the header as well. However if one book has a header and the other copy does not we might not match.


However in that case someone else might have tagged it and we can find a match using that pattern as well. So if 80% of the time we capture the consistent magic sentence or paragraph the other 20% of the time we don't if one of those 20% of the time someone tagged that book we will have that sentence in our database as well.

I am open to both approaches just want to get a feedback on the best approach and move forward.
jorm is offline   Reply With Quote
Old 03-13-2012, 02:11 PM   #9
WillAdams
Guru
WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.
 
WillAdams's Avatar
 
Posts: 979
Karma: 1915000
Join Date: Feb 2008
Device: Sony PRS-600, Fujitsu Stylistic ST-4121
Punctuation will fail if you're using quotes since quotation styles differ by region.
WillAdams is offline   Reply With Quote
Old 03-13-2012, 02:18 PM   #10
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
Seems that the content of the book is the only truly unique way to associated your copy of the book with mine as the same book if we can't both find the isbn.
The problem is, that you can't always map contents to the container (here a book).

Speaking mathematically the mapping of containers to contents is a surjective function and is generally not reversible, i.e. the container/book is not always distinct:
With the sentences approach you could identify a closed unit (story, romance, poem), but not what container (book/anthology/collection) it is in, as the same story can be contained in more than one book.

To identify a container one have to consider the hash values of all items in it (that's how the hash of e.g. Java's List is computed). The problem is: How can you split a container's content into it's elements? Perhaps there would be always a blank page as separator between the items, but maybe not always. Also you can't know a priori, if it is a collection of different stories or a collection of chapters belonging to the same story. I think, it would be better to process somehow the TOC.

There could also be "foreign content" in a book, like quotes or proverbs. So taking a sentence might lead you to a different book identified.

Last edited by Backi; 03-13-2012 at 02:21 PM.
Backi is offline   Reply With Quote
Old 03-13-2012, 02:25 PM   #11
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books since we are working with data that we can logically see and read. Punctuation would be used in the identification of a sentence or paragraph not in the hash code. While it is true that if you have two books containing the same sentence if an anthology or story in your example we might have an issue. Some books might not have a TOC like a text file.
jorm is offline   Reply With Quote
Old 03-13-2012, 02:33 PM   #12
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books since we are working with data that we can logically see and read.
Yes, but a song is a closed unit (as I wrote for poem, story etc.).
You couldn't say, which album or compilation it is from, except the very first container it was released in.
Backi is offline   Reply With Quote
Old 03-13-2012, 02:37 PM   #13
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 65,492
Karma: 43935573
Join Date: Nov 2006
Location: UK
Device: Kindle Voyage, iPad Mini, iPhone 4, MS Surface Pro, N7
Perhaps I'm being unduly cynical here, but it sounds to me as though this is something that would primarily be of use to people who download pirated books. The overwhelming majority of commercial eBooks that I've bought have had pretty good metadata.

Or have I misunderstood what's being requested?
HarryT is offline   Reply With Quote
Old 03-13-2012, 02:40 PM   #14
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.

Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.
jorm is offline   Reply With Quote
Old 03-13-2012, 02:43 PM   #15
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
Harry. Most of the books I am referring to are either short stories (no isbn), free books smashword or other creative common or public domain sources. Most of those I have to fix the data myself since they may not have been tagged properly when created.
jorm is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An idea about technical and reparing service paula-t enTourage eDGe 8 06-19-2011 07:55 PM
Ebook Idea - An Amazing Coincidence! Diso General Discussions 21 09-14-2010 01:52 PM
Idea for a $50 ebook reader ashkulz News 5 04-08-2007 12:08 PM
Site maintenance - first phase complete Alexander Turcic Announcements 1 12-06-2004 12:39 PM


All times are GMT -4. The time now is 08:26 PM.


MobileRead.com is a privately owned, operated and funded community.