MobileRead Forums - View Single Post

ldolse · 04-16-2011, 08:13 PM

Quote:

Originally Posted by telemetrics

I just downloaded Calibre and was just wondering about this feature. Thanks a lot.

Feature 1: OCR
Is it possible to extract first and last 3/4 pages of an eBook and run this on an OpenSource (or Free) OCR.
http://code.google.com/p/tesseract-ocr/

Feature 2: Autorun "Download metadata and covers" for all files where ISBN was found.

Feature 3: Detect ISBN in File Name.
ISBN number in File Names are found in some cases. They may not have a the prefix of the string 'ISBN' but just direct number ISBN10 or 13. However we need to clean the special chars like Underscores and Square Brackets.

Feature 4: ReOrder Suggestion based on Name
Incase multiple ISBN numbers are found then we could show the options and let the user select one (in just one click). The Optional ISBN Numbers can be looked up and the titles and authors can be displayed next to it.
However these should be ordered based on the Distance from the Title of the option to the file name of the ebook.
http://en.wikipedia.org/wiki/Levenshtein_distance

Adding OCR seems like an inordinate amount of work for a very small return just to discover the ISBN number in a small handful of books. I doubt that C code can be included in a plugin, it would generally require integration with Calibre and Calibre's build process, which also requires the OCR project to be set up for reliable cross-platform compilation. Beyond that, as it currently stands the pdf engine can't be trusted to reliably get detect/extract images from an image based pdf. Not sure if the new pdf engine is any better.

Number 2 can be accomplished by typing ISBN:True in the search box after using the plugin, highlighting everything, and clicking ctrl-D.

Number 3 can be done while importing the book as Kiwidude noted. There are a number of threads in the library management subforum, if you're not sure how to go about it I suggest searching/asking there.

While number 4 is something that could be done it seems like a lot of work for again little ROI (and the selections would likely include lots of false positives trying to guess if there is a title in the vicinity of the ISBN) - kiwidude maintains the plugin, so tackling something like that is up to him, but personally I'd rather see him investing his time in the dup detection plugin or one of the other projects.