View Single Post
Old 07-12-2009, 06:38 PM   #8
myle00
Connoisseur
myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.
 
myle00's Avatar
 
Posts: 71
Karma: 422
Join Date: Jun 2009
Device: Palm Treo
I have some java code that I wrote that does exactly that. I use pdftk to extract the first 20 pages of every pdf file. Than I use a commercial program to OCR these pages and save it as text. Once It's in text format I run the program and it collects all the ISBN numbers found in the doc. Many times there are multiple ISBNs because they advertise other books or for references. However the program decides which is the correct ISBN based on it's title from amazon and if there are duplicates and other things. If it cannot decide than it lists all and I can select the correct one. Than it renames the original file to "t;xxxxxxxxxxxxx.xxx and I import it to calibre. It was able to extract 5000 out of 6000 ISBNs and all my chm files. of course some of the missing didn't have ISBNs.

If you want it I can post the java code. But, it doesn't have a GUI and I usually run it in Eclipse. The only problem is the OCR. I couldn't find a good open source command line OCR program.

Last edited by myle00; 07-12-2009 at 06:41 PM.
myle00 is offline   Reply With Quote