Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 03-13-2012, 01:45 PM   #16
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by jorm View Post
Harry. Most of the books I am referring to are either short stories (no isbn), free books smashword or other creative common or public domain sources. Most of those I have to fix the data myself since they may not have been tagged properly when created.
Do sources like Smashwords not include decent metadata in their books? Or is it just that authors don't bother to include it? I've no experience of it myself.
HarryT is offline   Reply With Quote
Old 03-13-2012, 01:45 PM   #17
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by HarryT View Post
Perhaps I'm being unduly cynical here, but it sounds to me as though this is something that would primarily be of use to people who download pirated books.


I think these people would know, what they've searched for.
[- Hey, dude, I pirated a very thick book, yeah, must be very good!
- What's that?
- Dunno. Let me check on bookdb ... <calculating for a long time> ... Oh, it's the Bible, dude!]
Backi is offline   Reply With Quote
Advert
Old 03-13-2012, 01:53 PM   #18
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Backi View Post


I think these people would know, what they've searched for.
[- Hey, dude, I pirated a very thick book, yeah, must be very good!
- What's that?
- Dunno. Let me check on bookdb ... <calculating for a long time> ... Oh, it's the Bible, dude!]
I was thinking more of the case of the person who's downloaded thousands of random pirated books from (say) Usenet newsgroups, and is looking for a way to load them all into Calibre and get good metadata for them. I can certainly see the use of a tool for looking up data for an individual book, but I'd urge caution in setting up a system which would be a boon for pirates.
HarryT is offline   Reply With Quote
Old 03-13-2012, 02:10 PM   #19
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
As for Smashword some authors do some don't. Was just trying to come up with a good way for all of us whom manually add meta data can share our work. I see the point about piracy and while I dont want to encourage them. Is it fair to not build a system that may have a legit use because pirates may use it? Perhaps limit the number of lookups per day. For the average user it is not a big deal but for a pirate with thousands it might give them a headache?
jorm is offline   Reply With Quote
Old 03-13-2012, 02:31 PM   #20
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.

Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.
I think, one must choose data of a book, that is void of formatting/layout. I think of a single string of characters, where all other characters like white spaces are stripped off.

...

OK, I browsed a while. The single string idea brought me to DNS sequence analysis, then string searching algorithms like the Rabin–Karp algorithm and then what seems to me the solution (which might be too complex for one to implement, but there are perhaps some open source frameworks), because it represents the same use case:

Plagiarism detection

What we are talking here about like finding the first sentence, taking nouns etc. is the document model.
Backi is offline   Reply With Quote
Advert
Old 03-13-2012, 02:44 PM   #21
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
That is very interesting that it is the same use case. If Plagiarism (ie found in our database) then download meta. Otherwise if contains meta data then add it so we can match on it in the future.

I am thinking white space and and all punct would be stripped off.

Any other developers want to help
jorm is offline   Reply With Quote
Old 03-13-2012, 03:01 PM   #22
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.
jorm is offline   Reply With Quote
Old 03-13-2012, 03:52 PM   #23
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.
Yes, it would work with document models as abstractions. An appropriate document model must be found.
Document models of books/closed units would be generated, stored and then compared to the model of the given one for equality/similarity to a certain degree.
Backi is offline   Reply With Quote
Old 03-13-2012, 04:55 PM   #24
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
Backi, any chance you have some development skills and want to help out.
jorm is offline   Reply With Quote
Old 03-13-2012, 06:15 PM   #25
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
Backi, any chance you have some development skills
I have development skills, but I have to refuse your offer: I don't know much about information retrieval and after work I want to spend my spare time otherwise. Good luck for you and whoever joins you!
Backi is offline   Reply With Quote
Old 03-13-2012, 07:41 PM   #26
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
i understand. I would prefer to spend my time elsewhere. However I seem to get stuck with a idea in my head from time to time and need to let it vent. Still hoping to get someone else to help with the project.
jorm is offline   Reply With Quote
Old 03-14-2012, 08:13 AM   #27
transmitthis
Addict
transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.transmitthis ought to be getting tired of karma fortunes by now.
 
transmitthis's Avatar
 
Posts: 288
Karma: 1003542
Join Date: May 2011
Device: Google Nexus 7 16GB
Quote:
Originally Posted by jorm View Post
It seems if they can make applications that can listen to songs and identify a song based on wave frequencies we should have an easier time doing it for books
What about using that - have a computer "read" a portion, which would get rid of any white spaces, formatting, Punctuation mistakes. just a thought

Anyhoo, i like your idea, many calibre users have tagged their books, so helping others to not have to do the same is a great idea.

Dam, its got my brain going round in circles as to how you could reliably fingerprint a book. May have to use 2 stage algorithms, along with some fuzzy. good luck
transmitthis is offline   Reply With Quote
Old 03-14-2012, 09:31 AM   #28
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
I am looking for assistance from at least 1 other developer whom is interested in helping with this project. Preferably one familiar with python so we can use calibre for reading mobi and lit.
jorm is offline   Reply With Quote
Old 03-14-2012, 09:51 AM   #29
jorm
Member
jorm began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
I think the key is we use 3-4 options at once if one fails we fall through. We capture metrics different ways and populate a database.

We try to find the first chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations.

We try to find the last chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations.

Take the proper noun set from the first 20 pages and use a bit of fuzzy logic.


The idea is that we build a system that does all of these captures the metrics for all of them in the case of an add.

When someone searches we will try all of these till we get a success.

As for playing part of the book that would be interesting. Interesting, but might not be fast enough.
jorm is offline   Reply With Quote
Old 03-15-2012, 11:23 AM   #30
Fanas
Member
Fanas began at the beginning.
 
Posts: 21
Karma: 12
Join Date: Aug 2009
Device: none
I wonder if this would work, I have no idea how feasible it would be, so don't be mad:

1. Choose three random words which appear at least a few times. Excluding "a", "the", "in" and such.
2. Remove all punctuation and spacing.
3. Calculate letters between those words.
4. Generate a hash which would contain information about those three words, number of times the words appear and number of letters in-between each appearance.
5. When identifying a book, to prevent headers and formatting from interfering just set a certain threshold at which identification would be set as positive. If a certain word is almost always nearly the same distance from other word, then set it as positive.

Example (random quote from one book):
Quote:
Lawrence was holding the next to last sheet up to Prime Intellect's TV eye when the phone rang. "They didn't believe me. I'm shitcanned," Stebbins said.
"Didn't believe you about what?"
"The papers man, the goddamn Correlation Effect papers. I'm gonna kill you for this, I really am."
"The papers are right here. I just got through showing them to Prime Intellect. You need them back?"
"It don't matter now, I don't work here any more." There was a pause. "I bet they're gonna put you in jail for this."
Prime Intellect's face disappeared from the TV, and words began to scroll across the screen:

* JOHN TAYLOR IS IN THE ROOM WITH HIM. HE IS DIRECTING STEBBINS.

Lawrence read this as he talked. "Jail for what? I just borrowed the papers to see if Prime Intellect could expand on them."
Another pause. "What? It didn't come up with anything, did it?"
"Well, it's..." (Why do you care if you've just been fired? Lawrence wondered.)

* STEBBINS IS LYING. HE WENT TO TAYLOR AS SOON YOU LEFT AND TOLD HIM THAT YOU BROUGHT THEM TO ME.

"...too early..."

* TELL HIM YES.

"Actually, I think it's just noticed something. Hang on."

* TELL HIM IT POINTS TO A NEW FORM OF COSMOLOGY WHICH THEY DID NOT CONSIDER. INFINITE RANGE IS PROBABLY POSSIBLE WITH EXISTING HARDWARE. TELEPORTATION OF MATTER IS PROBABLY POSSIBLE.

Prime Intellect paused a moment, and the words PROBABLY were replaced with DEFINITELY.
Lawrence blinked, then typed into the little-used keyboard of his console,

> Is this true?
* YES.

"It says it will give you the stars," Lawrence said flatly.
"What? You been eating mushrooms, Lawrence? Lawrence?"

> What will it take to implement this?
* LET ME TRY SOMETHING.

"It says it will give you the stars. It says your faster than light chips can be made to work at infinite range. It says you can teleport matter."
Now there was a long, long pause. "That's bullshit," Stebbins finally said. "We tried everything."
Lawrence heard a small uproar through the phone, an uproar that would have been very loud on Stebbins' end. Men were arguing. A loud voice (Military Mitchell's, Lawrence thought) bellowed, "WHAT THE FUCK DO YOU MEAN?" Then there was the faint pop of a door slamming in the background.

* I'VE GOT IT. HANG ON.

None of them knew it at the time, but that was really the moment the world changed.
Same quote with everything that's not needed stripped off:
Quote:
LawrencewasholdingthenexttolastsheetuptoPrimeIntel lectsTVeyewhenthephonerangTheydidntbelievemeImshit cannedStebbinssaidDidntbelieveyouaboutwhatThepaper smanthegoddamnCorrelationEffectpapersImgonnakillyo uforthisIreallyamThepapersarerighthereIjustgotthro ughshowingthemtoPrimeIntellectYouneedthembackItdon tmatternowIdontworkhereanymoreTherewasapauseIbetth eyregonnaputyouinjailforthisPrimeIntellectsfacedis appearedfromtheTVandwordsbegantoscrollacrossthescr eenJOHNTAYLORISINTHEROOMWITHHIMHEISDIRECTINGSTEBBI NSLawrencereadthisashetalkedJailforwhatIjustborrow edthepaperstoseeifPrimeIntellectcouldexpandonthemA notherpauseWhatItdidntcomeupwithanythingdiditWelli tsWhydoyoucareifyouvejustbeenfiredLawrencewondered STEBBINSISLYINGHEWENTTOTAYLORASSOONYOULEFTANDTOLDH IMTHATYOUBROUGHTTHEMTOMEtooearlyTELLHIMYESActually IthinkitsjustnoticedsomethingHangonTELLHIMITPOINTS TOANEWFORMOFCOSMOLOGYWHICHTHEYDIDNOTCONSIDERINFINI TERANGEISPROBABLYPOSSIBLEWITHEXISTINGHARDWARETELEP ORTATIONOFMATTERISPROBABLYPOSSIBLEPrimeIntellectpa usedamomentandthewordsPROBABLYwerereplacedwithDEFI NITELYLawrenceblinkedthentypedintothelittleusedkey boardofhisconsoleIsthistrueYESItsaysitwillgiveyout hestarsLawrencesaidflatlyWhatYoubeeneatingmushroom sLawrenceLawrenceWhatwillittaketoimplementthisLETM ETRYSOMETHINGItsaysitwillgiveyouthestarsItsaysyour fasterthanlightchipscanbemadetoworkatinfiniterange ItsaysyoucanteleportmatterNowtherewasalonglongpaus eThatsbullshitStebbinsfinallysaidWetriedeverything Lawrenceheardasmalluproarthroughthephoneanuproarth atwouldhavebeenveryloudonStebbinsendMenwerearguing AloudvoiceMilitaryMitchellsLawrencethoughtbellowed WHATTHEFUCKDOYOUMEANThentherewasthefaintpopofadoor slamminginthebackgroundIVEGOTITHANGONNoneofthemkne witatthetimebutthatwasreallythemomenttheworldchang ed

Say we make the 3 words: tell (7 occurances), said (3 occurances), just (4 occurances)

Tell: 1-222-2-108-3-186-4-203-5-49-6-152-7
Said: 1-1047-2-260-3
Just: 1-297-2-127-3-134-4

It appears that 1st "tell" from the second is 222 letters apart, 2nd from 3rd - 108 letters apart and so on.

So once the hash is generated, it's up to statistics to decide whether the book in question is the same.
Fanas is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An idea about technical and reparing service paula-t enTourage eDGe 8 06-19-2011 06:55 PM
Ebook Idea - An Amazing Coincidence! Diso General Discussions 21 09-14-2010 12:52 PM
Idea for a $50 ebook reader ashkulz News 5 04-08-2007 11:08 AM
Site maintenance - first phase complete Alexander Turcic Announcements 1 12-06-2004 11:39 AM


All times are GMT -4. The time now is 09:16 PM.


MobileRead.com is a privately owned, operated and funded community.