Ebook tagging service (idea phase) - Page 2

HarryT · 03-13-2012, 01:45 PM

Quote:

Originally Posted by jorm

Harry. Most of the books I am referring to are either short stories (no isbn), free books smashword or other creative common or public domain sources. Most of those I have to fix the data myself since they may not have been tagged properly when created.

Do sources like Smashwords not include decent metadata in their books? Or is it just that authors don't bother to include it? I've no experience of it myself.

Backi · 03-13-2012, 01:45 PM

Quote:

Originally Posted by HarryT

Perhaps I'm being unduly cynical here, but it sounds to me as though this is something that would primarily be of use to people who download pirated books.

I think these people would know, what they've searched for.
[- Hey, dude, I pirated a very thick book, yeah, must be very good!
- What's that?
- Dunno. Let me check on bookdb ... <calculating for a long time> ... Oh, it's the Bible, dude!]

HarryT · 03-13-2012, 01:53 PM

Quote:

Originally Posted by Backi

I think these people would know, what they've searched for.
[- Hey, dude, I pirated a very thick book, yeah, must be very good!
- What's that?
- Dunno. Let me check on bookdb ... <calculating for a long time> ... Oh, it's the Bible, dude!]

I was thinking more of the case of the person who's downloaded thousands of random pirated books from (say) Usenet newsgroups, and is looking for a way to load them all into Calibre and get good metadata for them. I can certainly see the use of a tool for looking up data for an individual book, but I'd urge caution in setting up a system which would be a boon for pirates.

jorm · 03-13-2012, 02:10 PM

As for Smashword some authors do some don't. Was just trying to come up with a good way for all of us whom manually add meta data can share our work. I see the point about piracy and while I dont want to encourage them. Is it fair to not build a system that may have a legit use because pirates may use it? Perhaps limit the number of lookups per day. For the average user it is not a big deal but for a pirate with thousands it might give them a headache?

Backi · 03-13-2012, 02:31 PM

Quote:

Originally Posted by jorm

ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.

Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.

I think, one must choose data of a book, that is void of formatting/layout. I think of a single string of characters, where all other characters like white spaces are stripped off.

...

OK, I browsed a while. The single string idea brought me to DNS sequence analysis, then string searching algorithms like the Rabin–Karp algorithm and then what seems to me the solution (which might be too complex for one to implement, but there are perhaps some open source frameworks), because it represents the same use case:

Plagiarism detection

What we are talking here about like finding the first sentence, taking nouns etc. is the document model.

jorm · 03-13-2012, 02:44 PM

That is very interesting that it is the same use case. If Plagiarism (ie found in our database) then download meta. Otherwise if contains meta data then add it so we can match on it in the future.

I am thinking white space and and all punct would be stripped off.

Any other developers want to help

jorm · 03-13-2012, 03:01 PM

the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.

Backi · 03-13-2012, 03:52 PM

Quote:

Originally Posted by jorm

the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.

Yes, it would work with document models as abstractions. An appropriate document model must be found.
Document models of books/closed units would be generated, stored and then compared to the model of the given one for equality/similarity to a certain degree.

jorm · 03-13-2012, 04:55 PM

Backi, any chance you have some development skills and want to help out.

Backi · 03-13-2012, 06:15 PM

Quote:

Originally Posted by jorm

Backi, any chance you have some development skills

I have development skills, but I have to refuse your offer: I don't know much about information retrieval and after work I want to spend my spare time otherwise. Good luck for you and whoever joins you!

jorm · 03-13-2012, 07:41 PM

i understand. I would prefer to spend my time elsewhere. However I seem to get stuck with a idea in my head from time to time and need to let it vent. Still hoping to get someone else to help with the project.

transmitthis · 03-14-2012, 08:13 AM

Quote:

Originally Posted by jorm

It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books

What about using that - have a computer "read" a portion, which would get rid of any white spaces, formatting, Punctuation mistakes. just a thought

Anyhoo, i like your idea, many calibre users have tagged their books, so helping others to not have to do the same is a great idea.

Dam, its got my brain going round in circles as to how you could reliably fingerprint a book. May have to use 2 stage algorithms, along with some fuzzy. good luck

jorm · 03-14-2012, 09:31 AM

I am looking for assistance from at least 1 other developer whom is interested in helping with this project. Preferably one familiar with python so we can use calibre for reading mobi and lit.

jorm · 03-14-2012, 09:51 AM

I think the key is we use 3-4 options at once if one fails we fall through. We capture metrics different ways and populate a database.

We try to find the first chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations.

We try to find the last chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations.

Take the proper noun set from the first 20 pages and use a bit of fuzzy logic.

The idea is that we build a system that does all of these captures the metrics for all of them in the case of an add.

When someone searches we will try all of these till we get a success.

As for playing part of the book that would be interesting. Interesting, but might not be fast enough.

Fanas · 03-15-2012, 11:23 AM

I wonder if this would work, I have no idea how feasible it would be, so don't be mad:

1. Choose three random words which appear at least a few times. Excluding "a", "the", "in" and such.
2. Remove all punctuation and spacing.
3. Calculate letters between those words.
4. Generate a hash which would contain information about those three words, number of times the words appear and number of letters in-between each appearance.
5. When identifying a book, to prevent headers and formatting from interfering just set a certain threshold at which identification would be set as positive. If a certain word is almost always nearly the same distance from other word, then set it as positive.

Example (random quote from one book):

Quote:

Lawrence was holding the next to last sheet up to Prime Intellect's TV eye when the phone rang. "They didn't believe me. I'm shitcanned," Stebbins said.
"Didn't believe you about what?"
"The papers man, the goddamn Correlation Effect papers. I'm gonna kill you for this, I really am."
"The papers are right here. I just got through showing them to Prime Intellect. You need them back?"
"It don't matter now, I don't work here any more." There was a pause. "I bet they're gonna put you in jail for this."
Prime Intellect's face disappeared from the TV, and words began to scroll across the screen:

* JOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.

Lawrence read this as he talked. "Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them."
Another pause. "What? It didn't come up with anything, did it?"
"Well, it's..." (Why do you care if you've just been fired? Lawrence wondered.)

* STEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO ME.

"...too early..."

* TELL HIM YES.

"Actually, I think it's just noticed something. Hang on."

* TELL HIM IT POINTS TO A NEW FORM OF COSMOLOGY WHICH THEY DID NOT CONSIDER. INFINITE RANGE IS PROBABLY POSSIBLE WITH EXISTING HARDWARE. TELEPORTATION OF MATTER IS PROBABLY POSSIBLE.

Prime Intellect paused a moment, and the words PROBABLY were replaced with DEFINITELY.
Lawrence blinked, then typed into the little-used keyboard of his console,

> Is this true?
* YES.

"It says it will give you the stars," Lawrence said flatly.
"What? You been eating mushrooms, Lawrence? Lawrence?"

> What will it take to implement this?
* LET ME TRY SOMETHING.

"It says it will give you the stars. It says your faster than light chips can be made to work at infinite range. It says you can teleport matter."
Now there was a long, long pause. "That's bullshit," Stebbins finally said. "We tried everything."
Lawrence heard a small uproar through the phone, an uproar that would have been very loud on Stebbins' end. Men were arguing. A loud voice (Military Mitchell's, Lawrence thought) bellowed, "WHAT THE FUCK DO YOU MEAN?" Then there was the faint pop of a door slamming in the background.

* I'VE GOT IT. HANG ON.

None of them knew it at the time, but that was really the moment the world changed.

Same quote with everything that's not needed stripped off:

Quote:

LawrencewasholdingthenexttolastsheetuptoPrimeIntel lectsTVeyewhenthephonerangTheydidntbelievemeImshit cannedStebbinssaidDidntbelieveyouaboutwhatThepaper smanthegoddamnCorrelationEffectpapersImgonnakillyo uforthisIreallyamThepapersarerighthereIjustgotthro ughshowingthemtoPrimeIntellectYouneedthembackItdon tmatternowIdontworkhereanymoreTherewasapauseIbetth eyregonnaputyouinjailforthisPrimeIntellectsfacedis appearedfromtheTVandwordsbegantoscrollacrossthescr eenJOHNTAYLORISINTHEROOMWITHHIMHEISDIRECTINGSTEBBI NSLawrencereadthisashetalkedJailforwhatIjustborrow edthepaperstoseeifPrimeIntellectcouldexpandonthemA notherpauseWhatItdidntcomeupwithanythingdiditWelli tsWhydoyoucareifyouvejustbeenfiredLawrencewondered STEBBINSISLYINGHEWENTTOTAYLORASSOONYOULEFTANDTOLDH IMTHATYOUBROUGHTTHEMTOMEtooearlyTELLHIMYESActually IthinkitsjustnoticedsomethingHangonTELLHIMITPOINTS TOANEWFORMOFCOSMOLOGYWHICHTHEYDIDNOTCONSIDERINFINI TERANGEISPROBABLYPOSSIBLEWITHEXISTINGHARDWARETELEP ORTATIONOFMATTERISPROBABLYPOSSIBLEPrimeIntellectpa usedamomentandthewordsPROBABLYwerereplacedwithDEFI NITELYLawrenceblinkedthentypedintothelittleusedkey boardofhisconsoleIsthistrueYESItsaysitwillgiveyout hestarsLawrencesaidflatlyWhatYoubeeneatingmushroom sLawrenceLawrenceWhatwillittaketoimplementthisLETM ETRYSOMETHINGItsaysitwillgiveyouthestarsItsaysyour fasterthanlightchipscanbemadetoworkatinfiniterange ItsaysyoucanteleportmatterNowtherewasalonglongpaus eThatsbullshitStebbinsfinallysaidWetriedeverything Lawrenceheardasmalluproarthroughthephoneanuproarth atwouldhavebeenveryloudonStebbinsendMenwerearguing AloudvoiceMilitaryMitchellsLawrencethoughtbellowed WHATTHEFUCKDOYOUMEANThentherewasthefaintpopofadoor slamminginthebackgroundIVEGOTITHANGONNoneofthemkne witatthetimebutthatwasreallythemomenttheworldchang ed

Say we make the 3 words: tell (7 occurances), said (3 occurances), just (4 occurances)

Tell: 1-222-2-108-3-186-4-203-5-49-6-152-7
Said: 1-1047-2-260-3
Just: 1-297-2-127-3-134-4

It appears that 1st "tell" from the second is 222 letters apart, 2nd from 3rd - 108 letters apart and so on.

So once the hash is generated, it's up to statistics to decide whether the book in question is the same.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
An idea about technical and reparing service	paula-t	enTourage eDGe	8	06-19-2011 06:55 PM
Ebook Idea - An Amazing Coincidence!	Diso	General Discussions	21	09-14-2010 12:52 PM
Idea for a $50 ebook reader	ashkulz	News	5	04-08-2007 11:08 AM
Site maintenance - first phase complete	Alexander Turcic	Announcements	1	12-06-2004 11:39 AM

03-13-2012, 02:10 PM	#19
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	As for Smashword some authors do some don't. Was just trying to come up with a good way for all of us whom manually add meta data can share our work. I see the point about piracy and while I dont want to encourage them. Is it fair to not build a system that may have a legit use because pirates may use it? Perhaps limit the number of lookups per day. For the average user it is not a big deal but for a pirate with thousands it might give them a headache?

03-13-2012, 02:44 PM	#21
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	That is very interesting that it is the same use case. If Plagiarism (ie found in our database) then download meta. Otherwise if contains meta data then add it so we can match on it in the future. I am thinking white space and and all punct would be stripped off. Any other developers want to help

03-13-2012, 03:01 PM	#22
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.

03-13-2012, 04:55 PM	#24
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	Backi, any chance you have some development skills and want to help out.

03-13-2012, 07:41 PM	#26
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	i understand. I would prefer to spend my time elsewhere. However I seem to get stuck with a idea in my head from time to time and need to let it vent. Still hoping to get someone else to help with the project.

03-14-2012, 09:31 AM	#28
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	I am looking for assistance from at least 1 other developer whom is interested in helping with this project. Preferably one familiar with python so we can use calibre for reading mobi and lit.

03-14-2012, 09:51 AM	#29
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	I think the key is we use 3-4 options at once if one fails we fall through. We capture metrics different ways and populate a database. We try to find the first chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations. We try to find the last chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations. Take the proper noun set from the first 20 pages and use a bit of fuzzy logic. The idea is that we build a system that does all of these captures the metrics for all of them in the case of an add. When someone searches we will try all of these till we get a success. As for playing part of the book that would be interesting. Interesting, but might not be fast enough.

Advert

Advert