Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-11-2009, 12:09 PM   #1
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
[Old Thread] Auto Extract ISBN-Feature request

From the isbn website,
"Every ISBN consists of ten digits and whenever it is printed it is preceded by the letters ISBN. The ten-digit number is divided into four parts of variable length, each part separated by a hyphen."

I have some e-books in the .pdf format that are not text but images of regular paper books from a scanner. Most include a page that has an isbn number. Since these pdfs are scanned images, a text based search for the letters ISBN comes up empty. I was wondering if it would be possible to use some open source OCR software to convert the first few pages of a pdf to text and then search that text for isbn numbers and then use that to auto fill meta-data.

Thanks.
UnraisedArc is offline   Reply With Quote
Old 07-11-2009, 01:59 PM   #2
netseeker
sleepless reader
netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.netseeker ought to be getting tired of karma fortunes by now.
 
netseeker's Avatar
 
Posts: 4,770
Karma: 615335
Join Date: Jan 2008
Location: Germany, near Stuttgart
Device: Sony PRS-505, PB 360° & 302, nook wi-fi, Kindle 3
I second that request. Some time ago i implemented a similiar feature based on regular expressions in a Java application and it was rather a easy task. It would be really helpful if Calibre could detect the ISBN(s) automatically.

Last edited by netseeker; 07-11-2009 at 02:04 PM.
netseeker is offline   Reply With Quote
 
Enthusiast
Old 07-11-2009, 02:28 PM   #3
TMF
Enthusiast
TMF began at the beginning.
 
Posts: 40
Karma: 10
Join Date: May 2009
Device: PRS-505
Today, I opened a ticket with a similar feature request at http://calibre.kovidgoyal.net/ticket/2822:

"When importing PDF e-books, the ISBN is usually not part of the PDF metadata, but can be found on the copyright page (the page with the publication information, printing history, cataloguing information etc.) Most often it is provided on a line of its own in the form of "ISBN: xxxx", "ISBN-13: xxxx", "ISBN (hardcover): xxxx" or similar.

I am proposing an enhancement that would load the text of the first 10 or so pages of a PDF and search it for ISBNs of this type by means of a user-configurable regex. If several matches are found (e.g.: "ISBN-10", "ISBN-13" and "eISBN-10"), the user might be given the opportunity to select one from a dialog. This function would be invoked during the "Download metadata" process for books that don't already have an ISBN in their metadata, or it could be invoked manually from the "Edit metadata" dialog for individual books.

This enhancement would greatly improve the automation of the metadating of PDF files, because with a ISBN the "Fetch metadata from server" function will always provide the correct result, whereas when it relies only on author and title, it will often yield ambiguous or wrong results."
TMF is offline   Reply With Quote
Old 07-11-2009, 02:42 PM   #4
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
Hey guys,

Thank you for your considerations and making the ticket. Hopefully the developers can easily implement it.

Thanks again everyone for all your hard work!

Unraised
UnraisedArc is offline   Reply With Quote
Old 07-11-2009, 03:01 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,377
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The problem with this is that it will make metadata reading very slow, since the metadata readers wil have to unpack the entire contents of the file. I'm not convinced that doing this is worth the cost.
kovidgoyal is online now   Reply With Quote
Old 07-11-2009, 03:01 PM   #6
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
Here is a thought that solves the problem a different way.

In the case that it is too difficult or complex to implement with calibre, what if there were a standalone piece of software written that processes files (.pdf, .lit, .mobi, etc.) by the method you described BEFORE adding it to calibre:

"load the text of the first 10 or so pages of a PDF and search it for ISBNs of this type by means of a user-configurable regex. If several matches are found (e.g.: "ISBN-10", "ISBN-13" and "eISBN-10"), the user might be given the opportunity to select one from a dialog."

If this software does find an isbn number, it could rename the file as the ISBN number. Then, when importing the files into calibre, you could simply use the already-existing feature that takes the filename and saves it as tags.

For example, pretend this imaginary software finds that Alice_in_Wonderland.pdf has ISBN 0123456789 and renames the file to 0123456789.pdf. In calibre you could then deselect the "Read metadata from files" option and change the regex to put the filename in the ISBN tag. Then, after you have added all the files to calibre, you could simply bulk download metadata, and since the ISBN number would be saved as a tag, it should come back with good results.

Obviously I like your method better, but if it can't work, maybe this could.

Thanks again.

Last edited by UnraisedArc; 07-11-2009 at 03:05 PM.
UnraisedArc is offline   Reply With Quote
Old 07-12-2009, 02:19 PM   #7
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
To automatically extract ISBN's from pdf files that contain text I've had success using the macro program autohotkey.

Still wondering about a reasonable method to find the isbn of image pdfs...
UnraisedArc is offline   Reply With Quote
Old 07-12-2009, 06:38 PM   #8
myle00
Connoisseur
myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.
 
myle00's Avatar
 
Posts: 71
Karma: 422
Join Date: Jun 2009
Device: Palm Treo
I have some java code that I wrote that does exactly that. I use pdftk to extract the first 20 pages of every pdf file. Than I use a commercial program to OCR these pages and save it as text. Once It's in text format I run the program and it collects all the ISBN numbers found in the doc. Many times there are multiple ISBNs because they advertise other books or for references. However the program decides which is the correct ISBN based on it's title from amazon and if there are duplicates and other things. If it cannot decide than it lists all and I can select the correct one. Than it renames the original file to "t;xxxxxxxxxxxxx.xxx and I import it to calibre. It was able to extract 5000 out of 6000 ISBNs and all my chm files. of course some of the missing didn't have ISBNs.

If you want it I can post the java code. But, it doesn't have a GUI and I usually run it in Eclipse. The only problem is the OCR. I couldn't find a good open source command line OCR program.

Last edited by myle00; 07-12-2009 at 06:41 PM.
myle00 is offline   Reply With Quote
Old 07-13-2009, 12:10 AM   #9
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
Hey myle,

I have access to abbyy Finereader 9 (ocr software) and Adobe professional through my school. Not sure if that is the commercial ocr software you use, but if it is I would really like that code as I also have upwards of 7000 books and sometimes find myself lost trying to find the book I'm looking for.

Ps Just found calibre, and the more I use it, the better I like it!
UnraisedArc is offline   Reply With Quote
Old 07-13-2009, 12:46 AM   #10
myle00
Connoisseur
myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.
 
myle00's Avatar
 
Posts: 71
Karma: 422
Join Date: Jun 2009
Device: Palm Treo
Quote:
Originally Posted by UnraisedArc View Post
Hey myle,

I have access to abbyy Finereader 9 (ocr software) and Adobe professional through my school. Not sure if that is the commercial ocr software you use, but if it is I would really like that code as I also have upwards of 7000 books and sometimes find myself lost trying to find the book I'm looking for.

Ps Just found calibre, and the more I use it, the better I like it!
I use abbyy. I tried out most OCR programs and this seems the most stable. For example OmniPage crashes if the total number of pages is large while on my pc abbyy can do at one go 6000-9000 pages. What you should do is create an automated task which takes a folder and converts its files to text. You'll have to experiment on your computer to see how many pages it can do at once. Since abbyy doesn't have a command line, that will be the only manual step. For example it took for me almost a week to OCR all my files. If one batch is 6000 pages you'd have to run 24 batches for 7000 files - using only the first 20 pages of each file.

I have an exam on Tuesday and I want to tidy up the code with comments and such so I should be able to post it on Wednesday.
myle00 is offline   Reply With Quote
Old 07-15-2009, 09:10 AM   #11
river
Member
river began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jun 2009
Device: PRS-505
An Auto-ISBN import from text (perhaps not picture) based eBooks could work well, and really make the calibre meta-data lookup very effective.

An option to only, invoke the isbn lookup routine if a title/author match is not found may work well and increase efficiency.

I find that sub-titles and minor inconsistancies at the end of the title, fail to match. Even just having a year in brackets at the end of the title, will fail to match a well known book. So this would definately help the process.

As long as it's parameter based, so it can be switched off, then the user can decide if to take the hit on performance, even allowing to specify how many pages deep to search would help, I'd say up to 5 deep would get most ISBN's.
river is offline   Reply With Quote
Old 07-15-2009, 11:54 PM   #12
keenanjparsons
Junior Member
keenanjparsons began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2009
Device: none
This would definitely save me a lot of time! I never know which ISBN to use, the e-book one or the hardback/paperback one.
keenanjparsons is offline   Reply With Quote
Old 07-16-2009, 12:28 AM   #13
myle00
Connoisseur
myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.myle00 has a complete set of Star Wars action figures.
 
myle00's Avatar
 
Posts: 71
Karma: 422
Join Date: Jun 2009
Device: Palm Treo
I finally finished redoing it. The attached zip has the java doc as well as 3 classes. The main class is BookNames and in its main method you run everything. The other two classes are helper classes.

Firstly, I tried to make it OS independent so before you use it you'll have to make sure the class variables in all three classes are set for your specific situation. Nonetheless I only tried it on Windows so you should test it before you run it on a whole batch of files.

A I said you'd need to install pdftk and than run the genPDFTKcat() method over the folder with the pdfs. The method will save a text file with the ready commands to run in your OS since pdftk is OS independent. you'll have to make the text file into a batch file or paste it into you command terminal. Once pdftk is finished running you'll have to run the OCR software on the extracted pdf files.

When that is done, you'll have to run isbnDriver() which will do the rest. However, in order to run this you'll need to get a amazon associate number as well as the isbndb key since it'll have to download info from amazon and isbndb. Set the correct variables in copyURL to these two keys and it should work.

I just tested it on 850 files and it got ~680 of them. On 640 of them the program automatically selected the correct ISBN, I had to select the correct one only on 20. The rest I suspect either don't have ISBNs or OCR wasn't good enough. Also, sometimes it cannot move and rename the file because there is already there a file with the ISBN because it's a duplicate, so it'll save a text file listing all the files that failed and what they should have been renamed to. It also saves a backup list with all the old and new file names. But you'll see all that in the javadoc and comments.

There are a bunch of other methods there, most of them I use to manipulate book files and folders so I left them there since I think they'll be useful.

As always I don't take any responsibility blah blah blah. But it should work because it works for me. If there are any questions just pm or post.

Good luck,
Matt

b.t.w. I would recommend using the print book ISBN since amazon doesn't have much ebooks so it'll be harder to get metadata on the ebook. Even isbndb may not have all the ebooks, so you're better off with the print ISBN.
Attached Files
File Type: zip BookNames.zip (86.4 KB, 568 views)

Last edited by myle00; 07-16-2009 at 12:31 AM.
myle00 is offline   Reply With Quote
Old 07-16-2009, 06:44 PM   #14
UnraisedArc
Junior Member
UnraisedArc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2009
Device: none
Thank you! I will give this a try and let you now how it all works out.
UnraisedArc is offline   Reply With Quote
Old 05-03-2010, 12:10 PM   #15
tnt85
Junior Member
tnt85 began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2010
Device: kindle
Hello

You can extract isbn from pdf with acrobat javascript, that simply import files in calibre.

I used following acrobat javascript, though its not perfect it can be very useful:

/* Extract isbn */


ExtractFromDocument(0,25);

function ExtractFromDocument(start,end)
{
var chWord, numWords;
var Out = new Object();
var reMatch = /(?:ISBN[ -–]*(?:|10|13)|International Standard Book Number)[:\s]?(?:|, PDF ed.|, print ed.|\(pbk\)|\(electronic\))[:\s]?[\d][-– ]?[\dxX]/gi;

// construct filename for output document
var i = this.path.search(/[^:/]+\.pdf$/);
var fname = this.path.slice(i, this.path.length - 4);
var filename = fname;
var lastPages = 0;
try {

for (var i = start; i < end; i++)
{
numWords = this.getPageNumWords(i);
var PageText = "";
for (var j = 0; j < numWords; j++) {
var word = this.getPageNthWord(i,j,false);
PageText += word;
}

var strMatches = PageText.match(reMatch);
if (strMatches == null) continue;
for (j = 0; j < strMatches.length; j++) {
Out[strMatches[j]] = true;
}
if (i == end -1 && lastPages == 0){
//scan last 5 pages
lastPages = 1;
i = this.numPages-5;
end = this.numPages+1;
}
}
var nTotal = 0;
for (var prop in Out)
{
var temp = 0;
prop = prop.replace(/(isbn)([- ](10|13))?/gi,"");
prop = prop.replace(/[\r\n:a-wyz/(/)]/gi,"");
if (nTotal == 0) filename = prop;
//if (nTotal >= 1) continue;
//if (nTotal >= 1) filename += ","+prop;
//console.println("***"+prop+"***");
//nTotal++;
}

if (filename.length >= 1) this.saveAs("c:\\data\\" + filename + ".pdf");
if (this.disclosed) this.closeDoc();
}
catch(e)
{
//console.println("Processing error: "+e.message+" "+filename);
//print files with some error.
console.println(fname);
if (this.disclosed) this.closeDoc();
}

} // end of the function

Last edited by tnt85; 05-03-2010 at 12:17 PM.
tnt85 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Kobo future firmware feature request thread sabredog Kobo Reader 1029 07-19-2014 08:40 PM
Kobo future Hardware feature request thread Psyke Kobo Reader 1 01-07-2011 06:09 PM
[Old Thread] Calibre 'feature request' thread Waba Calibre 2 02-10-2010 07:52 PM
Extract ISBN from PDF? mdroberts Calibre 10 12-15-2009 01:35 AM
Feature request thread? Dahak Calibre 1 08-02-2009 12:51 AM


All times are GMT -4. The time now is 09:14 AM.


MobileRead.com is a privately owned, operated and funded community.