View Single Post
Old 04-13-2014, 02:45 AM   #2
Noobish
Junior Member
Noobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterNoobish can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 4
Karma: 12584
Join Date: Apr 2014
Device: none
ISBN Extraction with OCR Utility

After experimenting the Extract ISBN on Claibre, let me say it's an excellent addon, but also has many false positives in my experience. So I developed a "little" utility to extract ISBN by searching text in first 10 pages , if that fails it performs OCR on first 10 pages, if a valid isbn is found, u can rename or make a copy of these files to a chose directory, adding isbn to filename.



Requirements:
1- Attached is the program, u need http://www.microsoft.com/en-us/downl....aspx?id=40779
2- Download tesseract: http://tesseract-ocr.googlecode.com/...2-portable.zip
Extract , copy the tessdata folder to the "\Release" directory.
3- Might also need Visual C++ runtime if the program fails to start.

Features (currently):
Works only with non-encrypted and non-password protected pdf files.
Should be safe if you have multiple files with same name in different directories.

usage:
1- trigger the program, choose ur output directory from settings tab, save settings, restart the program and check whether the correct output directory is correct.
2- click the scan folders button , choose the BASE DIRECTORY OF YOUR BOOKS.
3- Once the List is populated click Start Search.
4- Once Searching is finished click "Save and CleanUP Button".

To import the recognized books into calibre:
Calibre Preferences-> Adding Books
copy: (?P<isbn>[0-9xX]+)
and paste it at Regular Expression.
Uncheck Read Metadata from File contents rather than file name.
Apply, save, [Restart Calibre].
import books normally.
Download metadata in bulk for imported books.

To Be Added (hav not decided yet):
Support for image enhancement b4 OCR.
Scanning page as image:
produce much more accurate results at the cost of speed.
a workaround for recognizing isbn from text books which stores information not in the same order as seen, using text objects...etc


Comments/Replies/Reporting Bugs/..etc appreciated.

NOTE: MAKE A COPY FIRST OF YOUR FILES, BEFORE USING IT.
Attached Files
File Type: rar Release.rar (4.34 MB, 465 views)

Last edited by Noobish; 04-13-2014 at 04:03 AM.
Noobish is offline   Reply With Quote