MobileRead Forums - View Single Post - [Old Thread] Auto Extract ISBN-Feature request

myle00 · 07-12-2009, 06:38 PM

I have some java code that I wrote that does exactly that. I use pdftk to extract the first 20 pages of every pdf file. Than I use a commercial program to OCR these pages and save it as text. Once It's in text format I run the program and it collects all the ISBN numbers found in the doc. Many times there are multiple ISBNs because they advertise other books or for references. However the program decides which is the correct ISBN based on it's title from amazon and if there are duplicates and other things. If it cannot decide than it lists all and I can select the correct one. Than it renames the original file to "t;xxxxxxxxxxxxx.xxx and I import it to calibre. It was able to extract 5000 out of 6000 ISBNs and all my chm files. of course some of the missing didn't have ISBNs.

If you want it I can post the java code. But, it doesn't have a GUI and I usually run it in Eclipse. The only problem is the OCR. I couldn't find a good open source command line OCR program.

07-12-2009, 06:38 PM	#8
myle00 Connoisseur Posts: 71 Karma: 422 Join Date: Jun 2009 Device: Palm Treo	I have some java code that I wrote that does exactly that. I use pdftk to extract the first 20 pages of every pdf file. Than I use a commercial program to OCR these pages and save it as text. Once It's in text format I run the program and it collects all the ISBN numbers found in the doc. Many times there are multiple ISBNs because they advertise other books or for references. However the program decides which is the correct ISBN based on it's title from amazon and if there are duplicates and other things. If it cannot decide than it lists all and I can select the correct one. Than it renames the original file to "t;xxxxxxxxxxxxx.xxx and I import it to calibre. It was able to extract 5000 out of 6000 ISBNs and all my chm files. of course some of the missing didn't have ISBNs. If you want it I can post the java code. But, it doesn't have a GUI and I usually run it in Eclipse. The only problem is the OCR. I couldn't find a good open source command line OCR program. Last edited by myle00; 07-12-2009 at 06:41 PM.