|04-12-2014, 01:48 AM||#1|
Join Date: Apr 2014
ISBN Extraction with OCR
hello , is there a utility to extract isbn from ebook using OCR? the Extract ISBN plugin is good, but for books which their first pages are in image formats it won't work.
if there is none, anyone interested in such utility? i can develop it via C# if anyone is interested.
Last edited by Noobish; 04-12-2014 at 02:45 AM.
|04-13-2014, 02:45 AM||#2|
Join Date: Apr 2014
ISBN Extraction with OCR Utility
After experimenting the Extract ISBN on Claibre, let me say it's an excellent addon, but also has many false positives in my experience. So I developed a "little" utility to extract ISBN by searching text in first 10 pages , if that fails it performs OCR on first 10 pages, if a valid isbn is found, u can rename or make a copy of these files to a chose directory, adding isbn to filename.
1- Attached is the program, u need http://www.microsoft.com/en-us/downl....aspx?id=40779
2- Download tesseract: http://tesseract-ocr.googlecode.com/...2-portable.zip
Extract , copy the tessdata folder to the "\Release" directory.
3- Might also need Visual C++ runtime if the program fails to start.
Works only with non-encrypted and non-password protected pdf files.
Should be safe if you have multiple files with same name in different directories.
1- trigger the program, choose ur output directory from settings tab, save settings, restart the program and check whether the correct output directory is correct.
2- click the scan folders button , choose the BASE DIRECTORY OF YOUR BOOKS.
3- Once the List is populated click Start Search.
4- Once Searching is finished click "Save and CleanUP Button".
To import the recognized books into calibre:
Calibre Preferences-> Adding Books
and paste it at Regular Expression.
Uncheck Read Metadata from File contents rather than file name.
Apply, save, [Restart Calibre].
import books normally.
Download metadata in bulk for imported books.
To Be Added (hav not decided yet):
Support for image enhancement b4 OCR.
Scanning page as image:
produce much more accurate results at the cost of speed.
a workaround for recognizing isbn from text books which stores information not in the same order as seen, using text objects...etc
Comments/Replies/Reporting Bugs/..etc appreciated.
NOTE: MAKE A COPY FIRST OF YOUR FILES, BEFORE USING IT.
Last edited by Noobish; 04-13-2014 at 04:03 AM.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|no text extraction for pdf with images and OCR||fxp33||Conversion||6||05-09-2013 03:51 AM|
|ASIAN, ISBN and ISBN-13||jbcohen||General Discussions||2||04-02-2013 02:27 PM|
|How to convert an OCR file to a Non-OCR one||res9282||1||08-05-2011 05:58 AM|
|Stupid Question: ISBN-10 and ISBN-13||Tegan||Library Management||4||03-11-2011 01:20 AM|
|PDF extraction – what is the best tool?||Prospect||21||09-27-2009 01:34 AM|