[GUI Plugin] Extract ISBN - Page 13

drMerry · 07-06-2011, 01:12 PM

I got a book where no ISBN was found.
Copy past the ISBN from the pdf to calibre was no problem.

Spoiler:

I'll send the book by pm

kiwidude · 07-06-2011, 02:09 PM

@drMerry - my guess from looking at the PDF is the text is behind an image. The PDF conversion engine never picks up the text in that situation, so there is no ISBN to find. I would guess that if you tried to convert that PDF to an EPUB you would find that page was rendered as an image in the EPUB.

drMerry · 07-06-2011, 05:55 PM

You're right.
Stupid I did not think of that.
Thanks for looking.

mobilemax · 08-08-2011, 04:41 PM

Any chance to add something like "timeout" option to the script? I have had some books where the script just stayed working for hours and it never finished. Would it be possible to say stop the task on current book after a specified time? e.g. 5 minutes maximum?

thanks!

kiwidude · 08-08-2011, 04:46 PM

@mobilemax - The problem will not be the time taken to scan, but the time taken to convert to epub (which is calibre code) prior to the scan. You must have a particularly nasty book that Calibre is choking on. As for whether it would be possible to force a timeout, I don't know - I will add it to the list to take a look at one day.

mobilemax · 08-08-2011, 04:53 PM

Quote:

Originally Posted by kiwidude

@mobilemax - The problem will not be the time taken to scan, but the time taken to convert to epub (which is calibre code) prior to the scan. You must have a particularly nasty book that Calibre is choking on. As for whether it would be possible to force a timeout, I don't know - I will add it to the list to take a look at one day.

Yep, had quite a few and since i decided to run the whole db through ExtractISBN, it's quite boring to find that it just did not finish "these and those 500 books" and you have to find the bad ones and skip them ;-)

But I still love the script of course! ;-)

Thanks

Btw, is there any way of limiting which formats it will parse? E.g. I have .txt/.epub with the same contents because .epub was created from .txt and it would make sense to skip the .txt to make it quicker...

kiwidude · 08-08-2011, 05:21 PM

No way to limit it, nor would many people want to (since unless you do all your own conversions you wouldnt know they were the same exact content. BeI would expect to be pretty quick anyways. It is formats like LRF and graphical PDFs that Calibre chokes on the most.

jlutes · 08-08-2011, 05:31 PM

I do find this script useful but it seems to fail on a pretty regular basis. Perhaps it's a problem on my end so let's start with that.
I routinely get a Windows exception error stating:
AppName: calibre-parallel.exe AppVer: 0.8.13.0 ModName: unknown
ModVer: 0.0.0.0 Offset: 025b80b5
Once I see that message I know I'm done and I might as well kill the job. I have let it sit for over an hour and it never will finish. The real kicker is, unless I'm missing something, the ISBNs it did find aren't applied if you have to kill the job. Scanning 500 books and finding out it crashed at 98% just makes my skin crawl.
My question is, what, if anything, can I provide to help find and squash whatever is causing this?

kiwidude · 08-08-2011, 06:08 PM

@jlutes - you need to figure out which book and format is causing your issue. My guess would be that is a problem with a PDF since that sort of crash is likely from C++ unmanaged code (which the PDF conversions use). If you find the book causing the crash, attach it with a bug report for Kovid to take a look at. There is nothing I or the plugin can do about this, it is calling existing Calibre code.

jlutes · 08-08-2011, 08:03 PM

I went to try and figure out if a certain format was causing the problem and found an even more interesting phenomenon. If I highlight a group of 10 books and run Extract ISBN, I get the error I described earlier. However, I can choose each book individually and run Extract ISBN on each one and it never errors. Are we looking at the same thing? Still a call to existing Calibre code causing the problem?

* Update *
I got a virtual memory warning on my machine (first I've ever seen) and found that there were about 50 Dr. Watson process running and each one of them was tied to a calibreparallel process. After I killed all of them and restarted Calibre it appears that it's attitude has changed. I am still getting Windows Exception errors but they aren't stopping the process.

capnm · 08-09-2011, 12:31 AM

Hmmm...
IIRC, running this against just a couple of books it is run as part of the main Calibre process, but select several books (user configurable threshold) and it spawns a background worker process.

Oddly, I had several issues with memory leaks while running as part of the main process, but the spawned background jobs have always been well behaved on my machines. But I'm almost all epub & mobi files.

I wonder what would happen if you raised the threshold in the plugin configuration and tried that same group of 10 as a foreground process instead of as a background process ....

And are your books pdfs? Or ....?

kiwidude · 08-09-2011, 02:54 AM

If you read back through this thread you would understand why it is different behaviour between running one versus multiple. Calibre has a major issue with memory leaks in the conversion process, so to work around this conversions should be done in the background. However if you are just doing a single ad hoc extract ISBN (which is how I usually tend to work) then for speed reasons I don't run it as a background job if you select only a single book.

It sounds like Calibre is crashing on your books when doing the extract when running in the background. No-one else has reported any issues with this, so I am inclined to believe it is something about the books you are scanning. You need to figure out the format that is causing the issue - duplicate the books (create empty books then merge in keeping the original), then one by one remove likely problem formats (starting with PDF) to see if it still errors.

Ababakar · 08-13-2011, 06:35 AM

as i´m currently starting using calibre and am in the process of editing all addes books i encountered some things regarding isbn extraction

First one:
Tricky: i got some books out of an edition with some other volumes. These are mentioned on page 2 (with their isbn) - the isbn of the actual book was on page 3. Don´t know if there is a solution for that (maybe a hint if more than one isbn is found). But anyway: Don´t trust the extraction blindly

Not to say that i don´t like your work kiwi - just to remind that things are never perfect

edit: just read the whle thread: this is kind of the same as mentioned in post #63 and #74 - so i guess this is alrady discussed. Just wanted to mention it.

third one:
it took me some hours to figure out such a nice search options as isbn:false - so maybe you should place this in the faqs of the plugin or something. But as i know it now it doesn´t bother me anymore

last one:
great plugin.

only thing left for me as a new user is try to find a way to easily add my comic collection (cbr+cbz). but that´s another topic.
Second one:
in some of my books there is no word as "ISBN" ü following number but the whole thing ("International Standard Book Number" + following number). Maybe this term can be included in later versions.

kiwidude · 08-13-2011, 06:47 AM

@Ababakar - welcome to MobileRead.

Yes as you will have read repeated through this thread there is no magic bullet for grabbing the ISBNs, and there will always be the odd situation where it either cannot find it or gets the wrong one if there are multiple. However these are the exception rather than the norm.

As for your cbr/cbz files - this plugin does not look for the word "ISBN" - the very first implementation did look for preceding words, however due to so many variations (and to cater for bad quality OCR scan errors) it now just looks for a sequence of numbers that start with the right prefix for ISBNs and validate as an ISBN. If it cannot find such in your comic books my guess is that they are images rather than text, which the plugin cannot scan. You can see from looking at the log as to what it text numbers it did attempt to match on, and you can always do a conversion to ePub to verify for yourself what "text" the plugin found available to scan (since for all but PDFs that is exactly what the plugin is doing in the background - silently doing a conversion to ePub and then scanning the html pages for text). If your comic shows up as an EPUB containing image files where the ISBN is then that proves the plugin will be unable to extract it.

Ababakar · 08-13-2011, 10:13 AM

oh sorry - i didn´t notice, that it no longer catches for phrases - just for the digits.
Anyway - as i am sorting my collection: i got some books where the isbn could not be extracted but can easily be found via okular (kde pdf viewer - so it is ocr´d and not only an image). + they are on the first 10 pages.
by the way - i also got a lot of ocr´d djvu´s where they could not be extracted but found via strg+f in my pdf-viewer. Did i read right that djvus won´t work at all?
Anyway: If you want i can collect those pdfs (and djvus) for you (i will simply print out the single page where i find the isbn with cups (linux pdf printer) to keep the data size small). But as i am doing like "5 books a day" this may take a while.

as for cbr/cbz - i know

- this was more a general comment than regarding isbn extraction. Only wanted to tell that i am not jet sure if calibre can help me with those ones. But as said - i may discuss this in another topic.

07-06-2011, 01:12 PM	#181
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	I got a book where no ISBN was found. Copy past the ISBN from the pdf to calibre was no problem. Spoiler: Starting job: Extract ISBN for 1 books Running scan for isbn query with parameters: {u'paths': [(u'PDF', u'C:\\local (laptop)\\Onbekend\\Great Book of Puzzles (18786)\\Great Book of Puzzles - Onbekend.pdf')], u'timeout': 30, u'title': u'Great Book of Puzzles'} ------------------------------- Scanning: C:\local (laptop)\Onbekend\Great Book of Puzzles (18786)\Great Book of Puzzles - Onbekend.pdf Scan time: 23.503000021 Great Book of Puzzles The scan failed to find an isbn in 23.50 seconds Failed to extract ISBN for Great Book of Puzzles Scan complete, with 1 failures I'll send the book by pm

08-08-2011, 04:41 PM	#184
mobilemax Member Posts: 13 Karma: 68 Join Date: Aug 2011 Device: Kindle	timeout option? Any chance to add something like "timeout" option to the script? I have had some books where the script just stayed working for hours and it never finished. Would it be possible to say stop the task on current book after a specified time? e.g. 5 minutes maximum? thanks!

08-08-2011, 08:03 PM	#190
jlutes Connoisseur Posts: 52 Karma: 12 Join Date: Jul 2011 Device: none	I went to try and figure out if a certain format was causing the problem and found an even more interesting phenomenon. If I highlight a group of 10 books and run Extract ISBN, I get the error I described earlier. However, I can choose each book individually and run Extract ISBN on each one and it never errors. Are we looking at the same thing? Still a call to existing Calibre code causing the problem? * Update * I got a virtual memory warning on my machine (first I've ever seen) and found that there were about 50 Dr. Watson process running and each one of them was tied to a calibreparallel process. After I killed all of them and restarted Calibre it appears that it's attitude has changed. I am still getting Windows Exception errors but they aren't stopping the process. Last edited by jlutes; 08-08-2011 at 11:29 PM.

08-13-2011, 10:13 AM	#195
Ababakar Member Posts: 23 Karma: 10 Join Date: Aug 2011 Device: none	oh sorry - i didn´t notice, that it no longer catches for phrases - just for the digits. Anyway - as i am sorting my collection: i got some books where the isbn could not be extracted but can easily be found via okular (kde pdf viewer - so it is ocr´d and not only an image). + they are on the first 10 pages. by the way - i also got a lot of ocr´d djvu´s where they could not be extracted but found via strg+f in my pdf-viewer. Did i read right that djvus won´t work at all? Anyway: If you want i can collect those pdfs (and djvus) for you (i will simply print out the single page where i find the isbn with cups (linux pdf printer) to keep the data size small). But as i am doing like "5 books a day" this may take a while. as for cbr/cbz - i know - this was more a general comment than regarding isbn extraction. Only wanted to tell that i am not jet sure if calibre can help me with those ones. But as said - i may discuss this in another topic. Last edited by Ababakar; 08-13-2011 at 10:15 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Extract ISBN from PDF?	mdroberts	Calibre	14	12-16-2016 07:32 AM
[Old Thread] Extract ISBN from file name	ChristianQ	Calibre	59	12-09-2015 05:08 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
[Old Thread] Auto Extract ISBN-Feature request	UnraisedArc	Calibre	60	03-23-2011 09:31 AM
Displaying ISBN column in the main GUI	tilleydog	Library Management	26	02-25-2011 04:08 AM

07-06-2011, 02:09 PM	#182
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@drMerry - my guess from looking at the PDF is the text is behind an image. The PDF conversion engine never picks up the text in that situation, so there is no ISBN to find. I would guess that if you tried to convert that PDF to an EPUB you would find that page was rendered as an image in the EPUB.

07-06-2011, 05:55 PM	#183
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	You're right. Stupid I did not think of that. Thanks for looking.

08-08-2011, 04:46 PM	#185
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@mobilemax - The problem will not be the time taken to scan, but the time taken to convert to epub (which is calibre code) prior to the scan. You must have a particularly nasty book that Calibre is choking on. As for whether it would be possible to force a timeout, I don't know - I will add it to the list to take a look at one day.

08-08-2011, 05:21 PM	#187
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	No way to limit it, nor would many people want to (since unless you do all your own conversions you wouldnt know they were the same exact content. BeI would expect to be pretty quick anyways. It is formats like LRF and graphical PDFs that Calibre chokes on the most.

08-08-2011, 05:31 PM	#188
jlutes Connoisseur Posts: 52 Karma: 12 Join Date: Jul 2011 Device: none	I do find this script useful but it seems to fail on a pretty regular basis. Perhaps it's a problem on my end so let's start with that. I routinely get a Windows exception error stating: AppName: calibre-parallel.exe AppVer: 0.8.13.0 ModName: unknown ModVer: 0.0.0.0 Offset: 025b80b5 Once I see that message I know I'm done and I might as well kill the job. I have let it sit for over an hour and it never will finish. The real kicker is, unless I'm missing something, the ISBNs it did find aren't applied if you have to kill the job. Scanning 500 books and finding out it crashed at 98% just makes my skin crawl. My question is, what, if anything, can I provide to help find and squash whatever is causing this?

08-08-2011, 06:08 PM	#189
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@jlutes - you need to figure out which book and format is causing your issue. My guess would be that is a problem with a PDF since that sort of crash is likely from C++ unmanaged code (which the PDF conversions use). If you find the book causing the crash, attach it with a bug report for Kovid to take a look at. There is nothing I or the plugin can do about this, it is calling existing Calibre code.

08-09-2011, 12:31 AM	#191
capnm Groupie Posts: 156 Karma: 10001 Join Date: Feb 2011 Device: sony	Hmmm... IIRC, running this against just a couple of books it is run as part of the main Calibre process, but select several books (user configurable threshold) and it spawns a background worker process. Oddly, I had several issues with memory leaks while running as part of the main process, but the spawned background jobs have always been well behaved on my machines. But I'm almost all epub & mobi files. I wonder what would happen if you raised the threshold in the plugin configuration and tried that same group of 10 as a foreground process instead of as a background process .... And are your books pdfs? Or ....?

08-09-2011, 02:54 AM	#192
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	If you read back through this thread you would understand why it is different behaviour between running one versus multiple. Calibre has a major issue with memory leaks in the conversion process, so to work around this conversions should be done in the background. However if you are just doing a single ad hoc extract ISBN (which is how I usually tend to work) then for speed reasons I don't run it as a background job if you select only a single book. It sounds like Calibre is crashing on your books when doing the extract when running in the background. No-one else has reported any issues with this, so I am inclined to believe it is something about the books you are scanning. You need to figure out the format that is causing the issue - duplicate the books (create empty books then merge in keeping the original), then one by one remove likely problem formats (starting with PDF) to see if it still errors.

08-13-2011, 06:35 AM	#193
Ababakar Member Posts: 23 Karma: 10 Join Date: Aug 2011 Device: none	as i´m currently starting using calibre and am in the process of editing all addes books i encountered some things regarding isbn extraction First one: Tricky: i got some books out of an edition with some other volumes. These are mentioned on page 2 (with their isbn) - the isbn of the actual book was on page 3. Don´t know if there is a solution for that (maybe a hint if more than one isbn is found). But anyway: Don´t trust the extraction blindly Not to say that i don´t like your work kiwi - just to remind that things are never perfect edit: just read the whle thread: this is kind of the same as mentioned in post #63 and #74 - so i guess this is alrady discussed. Just wanted to mention it. third one: it took me some hours to figure out such a nice search options as isbn:false - so maybe you should place this in the faqs of the plugin or something. But as i know it now it doesn´t bother me anymore last one: great plugin. only thing left for me as a new user is try to find a way to easily add my comic collection (cbr+cbz). but that´s another topic. Second one: in some of my books there is no word as "ISBN" ü following number but the whole thing ("International Standard Book Number" + following number). Maybe this term can be included in later versions.

08-13-2011, 06:47 AM	#194
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Ababakar - welcome to MobileRead. Yes as you will have read repeated through this thread there is no magic bullet for grabbing the ISBNs, and there will always be the odd situation where it either cannot find it or gets the wrong one if there are multiple. However these are the exception rather than the norm. As for your cbr/cbz files - this plugin does not look for the word "ISBN" - the very first implementation did look for preceding words, however due to so many variations (and to cater for bad quality OCR scan errors) it now just looks for a sequence of numbers that start with the right prefix for ISBNs and validate as an ISBN. If it cannot find such in your comic books my guess is that they are images rather than text, which the plugin cannot scan. You can see from looking at the log as to what it text numbers it did attempt to match on, and you can always do a conversion to ePub to verify for yourself what "text" the plugin found available to scan (since for all but PDFs that is exactly what the plugin is doing in the background - silently doing a conversion to ePub and then scanning the html pages for text). If your comic shows up as an EPUB containing image files where the ISBN is then that proves the plugin will be unable to extract it.

Advert

Advert