![]() |
#16 |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Do sources like Smashwords not include decent metadata in their books? Or is it just that authors don't bother to include it? I've no experience of it myself.
|
![]() |
![]() |
![]() |
#17 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
|
Quote:
![]() I think these people would know, what they've searched for. [- Hey, dude, I pirated a very thick book, yeah, must be very good! - What's that? - Dunno. Let me check on bookdb ... <calculating for a long time> ... Oh, it's the Bible, dude!] |
|
![]() |
![]() |
Advert | |
|
![]() |
#18 |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
I was thinking more of the case of the person who's downloaded thousands of random pirated books from (say) Usenet newsgroups, and is looking for a way to load them all into Calibre and get good metadata for them. I can certainly see the use of a tool for looking up data for an individual book, but I'd urge caution in setting up a system which would be a boon for pirates.
|
![]() |
![]() |
![]() |
#19 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
As for Smashword some authors do some don't. Was just trying to come up with a good way for all of us whom manually add meta data can share our work. I see the point about piracy and while I dont want to encourage them. Is it fair to not build a system that may have a legit use because pirates may use it? Perhaps limit the number of lookups per day. For the average user it is not a big deal but for a pirate with thousands it might give them a headache?
|
![]() |
![]() |
![]() |
#20 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
|
Quote:
... OK, I browsed a while. The single string idea brought me to DNS sequence analysis, then string searching algorithms like the Rabin–Karp algorithm and then what seems to me the solution (which might be too complex for one to implement, but there are perhaps some open source frameworks), because it represents the same use case: Plagiarism detection What we are talking here about like finding the first sentence, taking nouns etc. is the document model. |
|
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
That is very interesting that it is the same use case. If Plagiarism (ie found in our database) then download meta. Otherwise if contains meta data then add it so we can match on it in the future.
I am thinking white space and and all punct would be stripped off. Any other developers want to help |
![]() |
![]() |
![]() |
#22 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
the issue with using a plagiarism detection tool is I dont want to transfer the book to determine if it is there. That could be construed as piracy. By computing metrics or hash keys the work is distributed and we dont transfer the file.
|
![]() |
![]() |
![]() |
#23 | |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
|
Quote:
Document models of books/closed units would be generated, stored and then compared to the model of the given one for equality/similarity to a certain degree. |
|
![]() |
![]() |
![]() |
#24 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
Backi, any chance you have some development skills and want to help out.
![]() |
![]() |
![]() |
![]() |
#25 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
|
|
![]() |
![]() |
![]() |
#26 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
i understand. I would prefer to spend my time elsewhere. However I seem to get stuck with a idea in my head from time to time and need to let it vent. Still hoping to get someone else to help with the project.
|
![]() |
![]() |
![]() |
#27 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 288
Karma: 1003542
Join Date: May 2011
Device: Google Nexus 7 16GB
|
Quote:
![]() Anyhoo, i like your idea, many calibre users have tagged their books, so helping others to not have to do the same is a great idea. Dam, its got my brain going round in circles as to how you could reliably fingerprint a book. May have to use 2 stage algorithms, along with some fuzzy. good luck |
|
![]() |
![]() |
![]() |
#28 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
I am looking for assistance from at least 1 other developer whom is interested in helping with this project. Preferably one familiar with python so we can use calibre for reading mobi and lit.
|
![]() |
![]() |
![]() |
#29 |
Member
![]() Posts: 17
Karma: 10
Join Date: Mar 2012
Device: nook
|
I think the key is we use 3-4 options at once if one fails we fall through. We capture metrics different ways and populate a database.
We try to find the first chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations. We try to find the last chapter by looking for a lot of sentences of certain length with punctuation. We strip off the white space and punctuations. Take the proper noun set from the first 20 pages and use a bit of fuzzy logic. The idea is that we build a system that does all of these captures the metrics for all of them in the case of an add. When someone searches we will try all of these till we get a success. As for playing part of the book that would be interesting. Interesting, but might not be fast enough. |
![]() |
![]() |
![]() |
#30 | ||
Member
![]() Posts: 21
Karma: 12
Join Date: Aug 2009
Device: none
|
I wonder if this would work, I have no idea how feasible it would be, so don't be mad:
1. Choose three random words which appear at least a few times. Excluding "a", "the", "in" and such. 2. Remove all punctuation and spacing. 3. Calculate letters between those words. 4. Generate a hash which would contain information about those three words, number of times the words appear and number of letters in-between each appearance. 5. When identifying a book, to prevent headers and formatting from interfering just set a certain threshold at which identification would be set as positive. If a certain word is almost always nearly the same distance from other word, then set it as positive. Example (random quote from one book): Quote:
Quote:
Say we make the 3 words: tell (7 occurances), said (3 occurances), just (4 occurances) Tell: 1-222-2-108-3-186-4-203-5-49-6-152-7 Said: 1-1047-2-260-3 Just: 1-297-2-127-3-134-4 It appears that 1st "tell" from the second is 222 letters apart, 2nd from 3rd - 108 letters apart and so on. So once the hash is generated, it's up to statistics to decide whether the book in question is the same. |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
An idea about technical and reparing service | paula-t | enTourage eDGe | 8 | 06-19-2011 06:55 PM |
Ebook Idea - An Amazing Coincidence! | Diso | General Discussions | 21 | 09-14-2010 12:52 PM |
Idea for a $50 ebook reader | ashkulz | News | 5 | 04-08-2007 11:08 AM |
Site maintenance - first phase complete | Alexander Turcic | Announcements | 1 | 12-06-2004 11:39 AM |