MobileRead Forums - View Single Post

Noobish · 04-13-2014, 02:45 AM

After experimenting the Extract ISBN on Claibre, let me say it's an excellent addon, but also has many false positives in my experience. So I developed a "little" utility to extract ISBN by searching text in first 10 pages , if that fails it performs OCR on first 10 pages, if a valid isbn is found, u can rename or make a copy of these files to a chose directory, adding isbn to filename.

Requirements:
1- Attached is the program, u need http://www.microsoft.com/en-us/downl....aspx?id=40779
2- Download tesseract: http://tesseract-ocr.googlecode.com/...2-portable.zip
Extract , copy the tessdata folder to the "\Release" directory.
3- Might also need Visual C++ runtime if the program fails to start.

Features (currently):
Works only with non-encrypted and non-password protected pdf files.
Should be safe if you have multiple files with same name in different directories.

usage:
1- trigger the program, choose ur output directory from settings tab, save settings, restart the program and check whether the correct output directory is correct.
2- click the scan folders button , choose the BASE DIRECTORY OF YOUR BOOKS.
3- Once the List is populated click Start Search.
4- Once Searching is finished click "Save and CleanUP Button".

To import the recognized books into calibre:
Calibre Preferences-> Adding Books
copy: (?P<isbn>[0-9xX]+)
and paste it at Regular Expression.
Uncheck Read Metadata from File contents rather than file name.
Apply, save, [Restart Calibre].
import books normally.
Download metadata in bulk for imported books.

To Be Added (hav not decided yet):
Support for image enhancement b4 OCR.
Scanning page as image:
produce much more accurate results at the cost of speed.
a workaround for recognizing isbn from text books which stores information not in the same order as seen, using text objects...etc

Comments/Replies/Reporting Bugs/..etc appreciated.

NOTE: MAKE A COPY FIRST OF YOUR FILES, BEFORE USING IT.