Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 06-28-2010, 12:31 PM   #1
dpayment
Connoisseur
dpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enough
 
dpayment's Avatar
 
Posts: 90
Karma: 618
Join Date: Oct 2007
Location: Ottawa
Device: PocketBook Pro 902, EB-1150, PRS505, PRS700, Jetbook, Hanlin V3, Kobo
Question Finding and Deleting Duplicate Files of different formats

I posted this question several months ago, and got three or four answers, but they were all things I had already tried. Anyway, here goes again, but I'll try to clarify it a little more:

I've now got a couple of thousand ebooks on my computer, many of which are the same content, but in multiple different formats (lrf, pdf, lit, epub etc.). Also, many of these files either don't have the correct name, or have variations of the name because the sources of the files edited the titles before posting them. I've tried every search combination I can think of, in both basic and advanced search, and I've tried several duplicate finder programs. I've also looked in any FAQ I think might have the answer, but still can't find it. Does anyone know of a way or a piece of software to locate files with the similar content but in different file formats, PLEASE???

Surely I'm not the first person to realize this is a problem, there must be some way to search documents and have the search look for text that matches 80% or 90% or 95% of the original document.

I know I can do this handraulically, but that's going to be a real pain! Any help would be greatly appreciated.

Thanks,
Dan
dpayment is offline   Reply With Quote
Old 06-28-2010, 12:40 PM   #2
susan_cassidy
Wizard
susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.
 
Posts: 2,251
Karma: 3720310
Join Date: Jan 2009
Location: USA
Device: Kindle, iPad (not used much for reading)
If the file names were the same, except for the file extension, it would be pretty easy to write a script to do it. If the file names are NOT the same, except for the extension, it is going to pretty much be impossible, because due to the difference in formats, the contents are going to be wildly different, so you can't compare that.

I'm not sure if by "Also, many of these files either don't have the correct name, or have variations of the name because the sources of the files edited the titles before posting them" you are talking about the book title in the metadata, or the file title or what. A program to search for metadata would have to understand all the formats you are interested in.

I'm not sure if Calibre understands all those formats, but if it does, you can import the books into Calibre, and it will put all the books it thinks are the same, except for format, into the same subdirectory in its library. Don't know if that would be of use to you, or not.
susan_cassidy is offline   Reply With Quote
Advert
Old 06-28-2010, 01:19 PM   #3
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
Calibre understands just about every ebook format out there. You could import all your ebooks into calibre, delete the original files (calibre sets up its own private files), convert those files that aren't in your preferred format to that format, and then have calibre export books in that format to wherever you want them, with understandable names. Or just leave them all in calibre and deal with them from there. In calibre, you can remove formats on a book-by-book basis, remove all instances of a specific format, or remove all formats except the one you prefer.

Even more important, in calibre you are dealing with your books as books, not as files. It doesn't matter what names their files have; what matters is the metadata of the books themselves. Think of it like fonts: on a Windows system, the font you see as "Somefont Italic" might actually reside in a file named SFI__.TTF down in the depths of a folder you never visit. But you never see that name -- you just install it, use it, and uninstall it, as Somefont Italic. It's an abstraction -- a font -- not a file. That's what calibre does for books. It doesn't matter if the file is 234956234.prc, you see it as "A Tale of Two Cities" in calibre, and deal with it accordingly. You might have that book in three different formats, but you don't have to look at 234956234.prc, 234956234.lrf, and 234956234.epub; you just have "A Tale of Two Cities" which, when selected, lists its available formats as prc, lrf, and epub.
Worldwalker is offline   Reply With Quote
Old 06-28-2010, 01:42 PM   #4
dpayment
Connoisseur
dpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enough
 
dpayment's Avatar
 
Posts: 90
Karma: 618
Join Date: Oct 2007
Location: Ottawa
Device: PocketBook Pro 902, EB-1150, PRS505, PRS700, Jetbook, Hanlin V3, Kobo
Thanks Susan & Worldwalker, both of these answers are excellent, I hadn't thought to use Calibre to do the comparison, it just never occurred to me, but it makes perfect sense. Even if the file titles are different, the metadata should help to identify "most" of the duplicates, if not all.

Thanks again,
Dan
dpayment is offline   Reply With Quote
Old 06-28-2010, 02:26 PM   #5
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,479
Karma: 3846231
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
Dpayment,

I can't add anything to the good advice you've aready received. But I'd like to ask you a question. You say you've got a couple of thousand ebooks. Do you actually plan to read all of these? If not, why did you acquire them?

I'm only asking out of curiosity, you understand.
Mike L is offline   Reply With Quote
Advert
Old 06-28-2010, 03:27 PM   #6
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
I have thousands of pbooks. I've read all of them. Ebooks are less likely to cause structural damage to my floors.
Worldwalker is offline   Reply With Quote
Old 06-29-2010, 05:48 AM   #7
dpayment
Connoisseur
dpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enoughdpayment will become famous soon enough
 
dpayment's Avatar
 
Posts: 90
Karma: 618
Join Date: Oct 2007
Location: Ottawa
Device: PocketBook Pro 902, EB-1150, PRS505, PRS700, Jetbook, Hanlin V3, Kobo
Quote:
Originally Posted by Mike L View Post
Dpayment,

I can't add anything to the good advice you've aready received. But I'd like to ask you a question. You say you've got a couple of thousand ebooks. Do you actually plan to read all of these? If not, why did you acquire them?

I'm only asking out of curiosity, you understand.
Actually, yes I do plan on reading them all. I recently got rid of a collection of over 10,000 physical books that I'd collected, read & re-read, in some cases numerous times. I figure I'm good for another 25-30 years, and if you work out the numbers, averaging at least five books a week on average, I've got lots of reading time ahead of me.
dpayment is offline   Reply With Quote
Old 06-29-2010, 07:02 AM   #8
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,479
Karma: 3846231
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
Dpayment, I can only say I'm impressed. I reckon it would take me 75 years to read 2,000 books at my current rate. I admit I'm a slow reader, but even so ....
Mike L is offline   Reply With Quote
Old 06-29-2010, 04:12 PM   #9
oggelbe2007
Limited Warranty
oggelbe2007 will become famous soon enoughoggelbe2007 will become famous soon enoughoggelbe2007 will become famous soon enoughoggelbe2007 will become famous soon enoughoggelbe2007 will become famous soon enoughoggelbe2007 will become famous soon enough
 
oggelbe2007's Avatar
 
Posts: 89
Karma: 576
Join Date: Jul 2007
Location: North Georgia, USA
Device: A sweet PRS-500, DXG
I wouldn't worry about the duplicate files, just compress your hard drive image and write it out to a blue ray disk and you're good for at least a decade. Just keep about 5 years worth of reading material on your local hard drive(s) for your current reading requirements. Who knows what kind of memory density will be available in ten years, you may have several hundred terabytes local by 2010.

Last edited by oggelbe2007; 06-29-2010 at 07:12 PM.
oggelbe2007 is offline   Reply With Quote
Old 07-06-2010, 06:25 AM   #10
Baresi
Junior Member
Baresi began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2010
Device: none
Quote:
Originally Posted by dpayment View Post
I posted this question several months ago, and got three or four answers, but they were all things I had already tried. Anyway, here goes again, but I'll try to clarify it a little more:

I've now got a couple of thousand ebooks on my computer, many of which are the same content, but in multiple different formats (lrf, pdf, lit, epub etc.). Also, many of these files either don't have the correct name, or have variations of the name because the sources of the files edited the titles before posting them. I've tried every search combination I can think of, in both basic and advanced search, and I've tried several duplicate file finder programs. I've also looked in any FAQ I think might have the answer, but still can't find it. Does anyone know of a way or a piece of software to locate files with the similar content but in different file formats, PLEASE???

Surely I'm not the first person to realize this is a problem, there must be some way to search documents and have the search look for text that matches 80% or 90% or 95% of the original document.

I know I can do this handraulically, but that's going to be a real pain! Any help would be greatly appreciated.

Thanks,
Dan
I have thousands of pbooks. I've read all of them. Ebooks are less likely to cause structural damage to my floors.
Baresi is offline   Reply With Quote
Old 11-02-2010, 03:10 PM   #11
Hellmark
Wizard
Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.
 
Hellmark's Avatar
 
Posts: 2,549
Karma: 3799999
Join Date: Jun 2009
Location: O'Fallon, Missouri, USA
Device: Nokia N800, PRS-505, Nook STR Glowlight, Kindle 3
Those work by file size and other info. If the books are in different formats, they'll not be picked up as duplicates.
Hellmark is offline   Reply With Quote
Old 11-02-2010, 03:31 PM   #12
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
I'm suspicious of two people, with a single post each, no devices, no signs of other participation, suddenly showing up here to promote software.
Worldwalker is offline   Reply With Quote
Old 11-02-2010, 03:46 PM   #13
Alexander Turcic
Fully Converged
Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.
 
Alexander Turcic's Avatar
 
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
Quote:
Originally Posted by Worldwalker View Post
I'm suspicious of two people, with a single post each, no devices, no signs of other participation, suddenly showing up here to promote software.
Yes, commercial spam. Zapped.
Alexander Turcic is offline   Reply With Quote
Old 11-02-2010, 07:23 PM   #14
Hellmark
Wizard
Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.
 
Hellmark's Avatar
 
Posts: 2,549
Karma: 3799999
Join Date: Jun 2009
Location: O'Fallon, Missouri, USA
Device: Nokia N800, PRS-505, Nook STR Glowlight, Kindle 3
Quote:
Originally Posted by oggelbe2007 View Post
I wouldn't worry about the duplicate files, just compress your hard drive image and write it out to a blue ray disk and you're good for at least a decade. Just keep about 5 years worth of reading material on your local hard drive(s) for your current reading requirements. Who knows what kind of memory density will be available in ten years, you may have several hundred terabytes local by 2010.
It isn't about the space they take up. Most ebooks are so freaking miniscule, that you can have thousands easily fit on a CD, no need for bluray. Not wanting duplicates is a matter of wanting to easily be able to go through your ebook library. Duplicates simply complicate things, by having more to wade through. Which is quicker to go through, 1000 books, or 3400 books but only 1000 of which are different ones?
Hellmark is offline   Reply With Quote
Old 11-02-2010, 07:29 PM   #15
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
Quote:
Originally Posted by Hellmark View Post
Not wanting duplicates is a matter of wanting to easily be able to go through your ebook library.
That's the kind of thing calibre is good for. It's easy to spot similar books, check formats, merge if needed, etc.
Worldwalker is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate files on SD card drdman Astak EZReader 6 08-08-2010 07:03 PM
Some basics - duplicate files, filenames clintbradford Kobo Reader 3 07-11-2010 04:18 AM
Duplicate books - multiple formats mranlett Calibre 5 09-26-2009 07:02 AM
Deleting duplicate collections Gazman Introduce Yourself 3 01-25-2009 10:19 AM
Duplicate database files Zach Reading and Management 2 05-31-2005 05:47 AM


All times are GMT -4. The time now is 04:12 AM.


MobileRead.com is a privately owned, operated and funded community.