ISBN Extraction with OCR

Noobish · 04-12-2014, 01:48 AM

hello , is there a utility to extract isbn from ebook using OCR? the Extract ISBN plugin is good, but for books which their first pages are in image formats it won't work.

if there is none, anyone interested in such utility? i can develop it via C# if anyone is interested.

Noobish · 04-13-2014, 02:45 AM

After experimenting the Extract ISBN on Claibre, let me say it's an excellent addon, but also has many false positives in my experience. So I developed a "little" utility to extract ISBN by searching text in first 10 pages , if that fails it performs OCR on first 10 pages, if a valid isbn is found, u can rename or make a copy of these files to a chose directory, adding isbn to filename.

Requirements:
1- Attached is the program, u need http://www.microsoft.com/en-us/downl....aspx?id=40779
2- Download tesseract: http://tesseract-ocr.googlecode.com/...2-portable.zip
Extract , copy the tessdata folder to the "\Release" directory.
3- Might also need Visual C++ runtime if the program fails to start.

Features (currently):
Works only with non-encrypted and non-password protected pdf files.
Should be safe if you have multiple files with same name in different directories.

usage:
1- trigger the program, choose ur output directory from settings tab, save settings, restart the program and check whether the correct output directory is correct.
2- click the scan folders button , choose the BASE DIRECTORY OF YOUR BOOKS.
3- Once the List is populated click Start Search.
4- Once Searching is finished click "Save and CleanUP Button".

To import the recognized books into calibre:
Calibre Preferences-> Adding Books
copy: (?P<isbn>[0-9xX]+)
and paste it at Regular Expression.
Uncheck Read Metadata from File contents rather than file name.
Apply, save, [Restart Calibre].
import books normally.
Download metadata in bulk for imported books.

To Be Added (hav not decided yet):
Support for image enhancement b4 OCR.
Scanning page as image:
produce much more accurate results at the cost of speed.
a workaround for recognizing isbn from text books which stores information not in the same order as seen, using text objects...etc

Comments/Replies/Reporting Bugs/..etc appreciated.

NOTE: MAKE A COPY FIRST OF YOUR FILES, BEFORE USING IT.

04-12-2014, 01:48 AM	#1
Noobish Junior Member Posts: 4 Karma: 12584 Join Date: Apr 2014 Device: none	ISBN Extraction with OCR hello , is there a utility to extract isbn from ebook using OCR? the Extract ISBN plugin is good, but for books which their first pages are in image formats it won't work. if there is none, anyone interested in such utility? i can develop it via C# if anyone is interested. Last edited by Noobish; 04-12-2014 at 02:45 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
no text extraction for pdf with images and OCR	fxp33	Conversion	7	12-15-2015 07:22 AM
ASIAN, ISBN and ISBN-13	jbcohen	General Discussions	2	04-02-2013 02:27 PM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 05:58 AM
Stupid Question: ISBN-10 and ISBN-13	Tegan	Library Management	4	03-11-2011 01:20 AM
PDF extraction – what is the best tool?	Prospect	PDF	21	09-27-2009 01:34 AM

Advert