View Full Version : [GUI Plugin] Extract ISBN


Pages : [1] 2

kiwidude
03-23-2011, 09:37 AM
This plugin can be used to try to find the ISBN for a book using the text within a book format. It is intended as an alternative to various script based solutions to this problem posted in this thread (http://www.mobileread.com/forums/showthread.php?t=50691).

Main Features of v1.4.3

Scans all formats for the selected book(s) in preferred input format order until an ISBN-13 or ISBN-10 is found
Runs as a background job in Calibre, prompting you to update when the scanning is completed.
Scans only the book content, excluding HTML tag markup.
For PDF formats, scans only the first 10 pages, then if ISBN not found, the last 5 pages in reverse order.
For other formats, scans files at the front, then a number of end files in reverse order before the remainder of the book.
Restricts valid ISBN-13s to those that start with 977, 978 or 979. You can add additional prefixes in the configuration if required.
Optionally perform a search when completed showing you only the books updated (default is off). Some users may use this to then perform a metadata download.


Special Notes:

Requires calibre v0.8.54 or later.
As this runs in the background, you must be careful not to change the books being scanned while it is running. Changing the metadata such as title or author, deleting a book or performing a conversion will risk causing a problem. Restrict any editing to other books in your library while the scan is running and you will be fine.


Installation Notes:

Download the attached zip file and install the plugin/add to context menu or toolbar/restart calibre as described in the Introduction to plugins thread (http://www.mobileread.com/forums/showthread.php?t=118680).


Paypal Donations:

If you find this or any of my other plugins useful please feel free to show your appreciation. I have spent many hundreds of unpaid hours in their development and support so any encouragement for me to continue is appreciated!
https://www.paypal.com/en_US/i/btn/btn_donate_LG.gif (https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=RBHY43BYX9FVA)


Version History:
Version 1.4.3 - 01 Aug 2012
Split bulk extraction into batches with size changeable via plugin configuration

Version 1.4.2 - 03 Jun 2012
Minimum version set to calibre 0.8.54 (but preferred version is 0.8.55)
Performance optimisation for epubs for calibre 0.8.51 to reduce unneeded computation
Change to calibre API for deprecated dialog which caused issues that intermittently crashed calibre
Minor fix to ensure HTMLPreProcessor object is initialised correctly
Change to using different pdf engines for pdf processing due to calibre 0.8.53 breaking the one I was using.
Stability improvement will activate with calibre 0.8.55 by running pdf analysis on a forked thread

Version 1.4.1 - 12 Nov 2011
Exclude leading spaces before the ISBN number which prevented some valid ISBNs from being detected.

Version 1.4.0 - 11 Sep 2011
Upgrade to support the centralised keyboard shortcut management in Calibre

Version 1.3.7 - 02 Jul 2011
Fix bug of question dialog when metadata has changed not being displayed

Version 1.3.6 - 12 Jun 2011
Fix bug occurring when same ISBN extracted for a book
For non PDF file types, based on #files in books scan first x files, last y in reverse then rest
When scan fails, still give option to view the log rather than standard error dialog

Version 1.3.5 - 25 May 2011
Add yet another unicode variation of the hyphen separator to the regex

Version 1.3.4 - 21 May 2011
Run the ISBN extraction out of process to get around the memory leak issues

Version 1.3.3 - 19 May 2011
Ensure stripped HTML tags replaced with a ! to prevent ISBN running into another number making it invalid

Version 1.3.2 - 17 May 2011
Strip the <style> tag contents to ensure panose-1 numbers are not picked up as false positives

Version 1.3.1 - 06 May 2011
Strip non-ascii characters from the pdfreflow xml which caused it to be invalid
Support the ^ character being part of the ISBN number
Attempt to minimise any memory leak issues caused by this plugin itself

Version 1.3 - 29 Apr 2011
Do all scanning as a background job to keep the UI responsive
Remove all interactive UI options - it will now always scan all formats in preferred order
Make sure that ISBN-13s start with 977, 978 or 979 (configurable).
Exclude the various repeating digit ISBNs of 1111111111 etc.
Exclude all html markup tags to prevent issues like the svg sizes being picked up as ISBNs
Include endash and other dash variants as possible separators
When scanning PDF documents, scan the last 5 pages in reverse order so it is the last ISBN found
Configuration option for ISBN13 prefixes and option to show updated books when extract completes

Version 1.2.1 - 09 Apr 2011
Support skinning of icons by putting them in a plugin name subfolder of local resources/images

Version 1.2 - 03 Apr 2011
Rewritten for new plugin infrastructure in Calibre 0.7.53
ISBN matching regex replaced using an approach from drMerry
PDFs now processed with new Calibre PDF engine to scan just first 10 and last 5 pages

Version 1.1 - 28 Mar 2011
Add configuration options over the scan behaviour (default + alternate)
The options you have are:
Ask me which format to scan
Scan only the first format in preferred input order
Scan all formats in preferred input order until an ISBN found

Version 1.0.1 - 24 Mar 2011
Skip book formats which we are unable to read, such as djvu
Display progress in the status bar
Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple.
It will use the first found based on your preferred input format order list from Preferences->Behaviour

Version 1.0 - 24 Mar 2011
Initial release of Extract ISBN plugin

talonius
03-23-2011, 09:54 AM
I. Love. You.

Now... if we could add the extraction to the Edit Book Details window (like to the right of the ISBN text box) and then have an option to download metadata if an ISBN is found... I would have your baby.

(Although, yes, I can edit a batch and then download a batch. I tend to edit one at a time so I think one at a time. :))

This has worked beautifully on 480 out of 500 books. And the 20 that didn't work I confirmed were PDFs where the contents were JPG images rather than text -- so no way for the regex to pick up the ISBN.

Oh, some sort of progress indicator would be beneficial. (Dunno if possible.)

kiwidude
03-23-2011, 10:16 AM
Cool, glad it worked for you!

I agree that some sort of progress indicator would be useful. I just wanted to get "something" out there to see what the interest was, how people wanted to approach the multiple format/selection issue etc.

Your point about the edit book details window also confirms why I did not invest a great deal more effort at this point beyond "proving it was possible". As this plugin just wires together and resuses a few bits of Calibre code there really isn't any technical reason why it couldn't be built natively into Calibre. It is entirely down to Kovid and whether he wants to make the functionality available from screens like the Edit Metadata and Bulk Metadata dialogs.

pchrist7
03-23-2011, 11:22 AM
I. Love. You.
.. I would have your baby.

(Although, yes, I can edit a batch and then download a batch. I tend to edit one at a time so I think one at a time. :))


Wow - I'm getting old here.
Being a batchelor, old and all, I haven't been keeping upto date with procreation, I see
:rofl::rofl::rofl::rofl::rofl:

Sorry All, especially talonious & kiwidude !!!
WILL TRY to just read from now on, instead of being "funny"

talonius
03-23-2011, 01:08 PM
Minor issue: If there's a format stored in Calibre that Calibre doesn't know how to handle (DejaVu in this instance) the plugin throws an error and aborts processing.

Possible optimization: Abort searching through the book once a certain percentage/amount of text has been searched. This would help speed up the search for 95% of the books.

Building it into Calibre would be fantastic but since this is the major roadblock to me finishing my catalog, I'm going to continue to push it. <g> No worries, I'm looking at how to do all of my suggestions myself as possible improvements. I work in C#/C++ professionally, just not Python/Calibre. I'll just have to buckle down and do some (gasp!) reading.

As for jokes... ha! Trust me, I'm far from serious. One reason I don't participate in projects is because my joking attitude tends to grate on the more serious folks who tend to inhabit the programmer's world.

kiwidude
03-23-2011, 03:32 PM
@Talonius - I will push a 1.0.1 version shortly which will ensure any errors are more gracefully handled. It will also display progress in the status bar.

The optimization stuff is a tough one. The problem is that I have seen books where the copyright/ISBN information has been put at the end of the EPUB. Granted this is the exception rather than the rule, but maybe others have seen it frequently? This is the sort of operation that you will only do once on your books though so performance shouldn't be too much of an issue...

Also, I think most of the slowdown will be in the time taken to convert each book into text, not the bit the plugin does of applying regex expressions on each file in it. I haven't profiled it but I am pretty confident that will be the case.

What I have done is get it to short-circuit gathering ISBNs once it has found an ISBN and finished processing the current internal file of the converted format. The logic I "borrowed" from bazbar scanned the whole book and built up lists of ISBNs should a book have multiple ISBN13s for instance. I don't know enough about when that ever happens (most books I have seen have only either one or both of an ISBN10/ISBN13 but not more than that). Finishing processing a file (hopefully all ISBNs are on the same one) and then stopping should be enough. This won't help speed up books with no ISBN inside though.

I am also about to make it that if you ctrl+click or shift+click on the toolbar button it will do a non-interactive decision of which format to interrogate when you have multiple. This will be based on your preferred input format list in Preferences for now. I'll wait for suggestions for alternatives before doing anything else around that. For people who only have formats produced by converting the same version that will work well. Where it won't is say if they got a PDF from somewhere and an EPUB from somewhere else, and the EPUB has had the ISBN stuff removed. Still, at least you will see in the report which books it failed to find an ISBN for, and you can always then just do a normal toolbar button click to get the interactive choice of format to extract from.

kiwidude
03-23-2011, 04:15 PM
I've mentioned most of this in the previous post but to recap:

Skip book formats which we are unable to read, such as djvu
Display progress in the status bar
Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple. It will use the first found based on your preferred input format order list from Preferences->Behaviour

garcle
03-25-2011, 03:01 AM
Great and very useful plugin, thanks much.

one comment though, I have been able to (inadvertently) "choke" the plugin on a document with 1800 pages and 2million words. It is a text pdf, and as it turns out there is no isbn amongst the 2 million words. Is it possible to have a "fail gracefully after x time" capability?

Thanks again for what is otherwise a very useful plugin.

kiwidude
03-25-2011, 05:11 AM
@garcle - see my comments above in post #6. The way to test this would be to go into convert, choose search & replace and click one of the wizard buttons. That will ask Calibre to convert the document in the exact same way that my ISBN extract does. Check how long it takes for it to do this with your big PDF file to get to a point of text being displayed in the wizard box, versus how long it takes the extract ISBN functionality.

If the times are comparative, there is nothing I can do, at least not without rewriting the text conversion functionality to perhaps say just convert a small % of the document. Which I have no intention of doing myself :)

OTOH if you think the ISBN functionality is still significantly slower than the S&R wizard then I could take a look at it. If you point me at a download somewhere of a PDF typical of the issue I will see what I can do.

Doug-W
03-26-2011, 12:15 AM
I've mentioned most of this in the previous post but to recap:

Skip book formats which we are unable to read, such as djvu
Display progress in the status bar
Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple. It will use the first found based on your preferred input format order list from Preferences->Behaviour


Could you make that last step be two options?
1) Run in non-interactive by default or interactive by default.
2) Follow preferred input format, or continue searching all if not found in first? I format down some of my epubs which is my preferred format.

kiwidude
03-26-2011, 07:35 AM
@Doug-W - thanks for the suggestions. I am applying them at the moment and will push a new version when done.

When searching all formats, do you think that option should be dependent on whether the user has interactively chosen a format? i.e. If I have interactively chosen a specific format, it should always stop after seaching just that format. Whereas the "search all formats until found using preferred order" only applies when you are doing a non-interactive search?

Hope I explained myself, it is very difficult to wrap the wording around as per the screenshot - any suggestions for alternate wording welcomed :)

EDIT: Removed the screenshot, came up with a simpler approach...

kiwidude
03-27-2011, 11:16 AM
This release adds some configuration options over the scan behaviour for when there are multiple formats for a book. You can configure both a default behaviour and an alternate behaviour (the latter when you shift+click or ctrl+click on the plugin as a toolbar button).

The options you have are:

Ask me which format to scan
Scan only the first format in preferred input order
Scan all formats in preferred input order until an ISBN found

Note that the last option can be slow, if you care about performance. As I have commented previously on this thread any performance issues I can do very little about - it is down to the performance of the converters and the nature of the conversion where the bulk of the time is spent.

garcle
03-27-2011, 11:16 PM
Any way to force a refresh on the book list?
the isbns dont show up in the book list (bit do show in the book metadata editor form) after the plugin runs.

kiwidude
03-28-2011, 06:52 AM
This adds two things:

Ensure an ISBN custom column or the Identifiers count in the tag browser is refreshed after retrieving ISBN values
Adds an Abort button to the select format dialog, in case you accidentally started interactively searching a large selection


Thanks @garcle for reporting the refresh issue.

Calibrefan
03-29-2011, 04:23 AM
Thanks kiwidude for this very useful plugin!

sdspieg
03-29-2011, 05:33 PM
Super plugin! Thanks much... Calibre still keeps getting better and better...

Any idea why it does not work on all files though? I have some books in my collection for which I CAN find the isbn number when I open the pdf file and look for it myself, but that the plugin didn't get right... Would you be interested in some books for which it doesn't work?

Cheers,

-Stephan

kiwidude
03-29-2011, 06:01 PM
Hi Stephan - There are two reasons I can think of. The first is that your PDF is actually not containing text and instead just contains images of the content. If that is the case, there is nothing that can be done until the book is OCR'd.

If that is not the case, then perhaps I need to adjust the regular expression that finds the ISBN numbers. If you have a PDF example, drop me a PM with a link to somewhere I can download it from.

drMerry
03-29-2011, 07:10 PM
Hi, great plugin.

Here is some code you can add to your plugin.
It will read isbn that uses spaces in stead of -
it also will import isbn codes dat start as
lsbn (L)
1sbn (one)
IS8N (eight)
I8BN
and combinations of this.
This in case of scanned books.

Please keep up the good work!

(and to other people, this is an edit of the original plugin, it works in my case but I do not guarantee it works for you. To be save, please use the original plugin and wait of original update!)

sdspieg
03-29-2011, 07:28 PM
Hi Stephan - There are two reasons I can think of. The first is that your PDF is actually not containing text and instead just contains images of the content. If that is the case, there is nothing that can be done until the book is OCR'd.

Nope. They're text-based

If that is not the case, then perhaps I need to adjust the regular expression that finds the ISBN numbers. If you have a PDF example, drop me a PM with a link to somewhere I can download it from.
I did.

-Stephan

kiwidude
03-29-2011, 07:31 PM
Hi, great plugin.

Here is some code you can add to your plugin.
It will read isbn that uses spaces in stead of -
it also will import isbn codes dat start as
lsbn (L)
1sbn (one)
IS8N (eight)
I8BN
and combinations of this.
This in case of scanned books.

Please keep up the good work!

(and to other people, this is an edit of the original plugin, it works in my case but I do not guarantee it works for you. To be save, please use the original plugin and wait of original update!)
Hi DrMerry,

I've just taken a quick look at your changes to look to incorporate into the next release this weekend (most of my plugins will get a new release to support some changes Kovid is making). However I am not too sure about that replacement of spaces in the text. Doesn't that cause a problem for the entire rest of the regex, such as "International Standard Book Number" etc? What was the thinking behind that?

EDIT: Never mind, sorry, clearly looked at it *too* quickly, I see you are applying it at the point after the regex has been applied. I will look to include in my next release, thx.

drMerry
03-29-2011, 07:42 PM
Hi DrMerry,

I've just taken a quick look at your changes to look to incorporate into the next release this weekend (most of my plugins will get a new release to support some changes Kovid is making). However I am not too sure about that replacement of spaces in the text. Doesn't that cause a problem for the entire rest of the regex, such as "International Standard Book Number" etc? What was the thinking behind that?

I've got a lot of books that had no ISBN and did not get one after using your plugin.
It occured to me that 90% of my non-isbn tagged books (total of 2018 have no ISBN in calibre) had an isbn in the text.

Some had 1sbn or IS8N, but most had 978 123 456 x

It does not give problems with isbn you can try some in http://www.regextester.com/
This is because you build a good regex yourself ;)
I strip the spaces out of match(1). This is the match that only contains numbers, spaces and - (At this moment I think, maybe there are also 978.123.456.x numbers).
The last part had to move to avoid a space at the end of the code. This gave some errors.

One thing: You used the match(1), but I'm not sure why? Why did you made 2 matches, match 0 is never used.

One other thing: I also have books with ISBN on the last, second-last or last page before the footnotes. Is it possible to add an option to start the search at the end of the file?

Hope I've been clear enough. Going to sleep now (it's 1.41 here, 6.00 is wake-up time :()

kiwidude
03-29-2011, 07:54 PM
Thx for the extra info, I had edited my post while you were typing obviously :)

I claim no credit for the regex or that loop of code applying the matches - as per my very first post I took this code from bazbar's script that people were using. I figured they had in turn built it based on all of the earlier versions so just assumed it was "proven". Since you have questioned it I will take a look.

As for your question about searching from the end of the book. I assume this is a performance thing - and my answer remains the same as previously. I am at the mercey of the current implementation of the Calibre input converters. They do not stream the results to me, I cannot control their direction. I give them a path, and when they are "done" they give me a bunch of stuff representing the converted EPUB back.

I hadn't appreciated any real performance issues with piggy-backing off this until just now when I tried a PDF which had graphics in it that Stephan sent me above, and now I see why some of you would like something faster! I will ask in the dev forum if there is any possibility of an overload or something that would support that - all we really want for this functionality is something like the first 10 pages and (maybe) last 5 of a book. I'm not optimistic they will consider this plugin worth the effort if it is anything but trivial to support it but you don't know unless you ask...

kiwidude
03-29-2011, 07:58 PM
@Stephan - Thx for sending me the PDF/EPUB. The EPUB returns an ISBN with the changes made by DrMerry - it is because the ISBN has spaces in it rather than dashes. So my next release will support that for you.

The PDF I don't know why it failed as yet (DrMerry's additions did not find that ISBN either). I'll post back here when I figure out why...

EDIT: Ok, the reason for the PDF you sent me is as per my first response on this that particular part of the PDF is being seen by Calibre as an image, not as parseable text. You can see this if you do a conversion to EPUB in Calibre, the first pages of this at least are converted to imges.

I also discovered that it is the cover page that causes the Calibre conversion to run so slowly in this instance (I stripped off the leading 5 pages into a new PDF). I will raise a ticket on the bug tracker and see if someone can take a look, perhaps they can both explain why the ISBN page is being treated as an image (when the text is selectable in a PDF reader) and maybe an optimisation to not take so long on the cover page.

user_none
03-29-2011, 09:13 PM
the first pages of this at least are converted to imges.

The current PDF engine does not support text under images. This is why you can select it but it's coming through as an image.

I also discovered that it is the cover page that causes the Calibre conversion to run so slowly in this instance

calibre uses pdftohtml to turn the PDF into HTML which it then cleans up and converts. There are easily 100 special processing rules to clean up the HTML from pdftohtml. They are all regular expression based. Most likely those pages are producing very complex and messy output which is causing a large number of rules to be run.

No one has any desire to fix the existing engine. A new PDF engine is in the works but development has stalled. Finishing the new engine would be a better time investment than trying to further work around pdftohtml issues.

drMerry
03-30-2011, 01:35 AM
Thx for the extra info, I had edited my post while you were typing obviously :)

I claim no credit for the regex or that loop of code applying the matches - as per my very first post I took this code from bazbar's script that people were using. I figured they had in turn built it based on all of the earlier versions so just assumed it was "proven". Since you have questioned it I will take a look.

Some new hobby time to spent :D But if you have questions, please mail me, you got my address.

3 new problems occured:
1. It was not posible for me to read the isbn for in this case (# equals a \d (not x)). This is te complete line, so after the last number there is a linefeed.
ISBN 978 ## ### #### #
2. I've seen some really strange ISBN-numbers. Do not know if you even want to support it but I've seen a few cases the control number being a letter (not x). I've seen it just a few times. But the most strange thing is, in case the number should be 4, they used D. As a programmer you know about the problem of starting with 0 or 1. I do not know if this type of isbn is used often, In my collection of 3000+ books I have only seen it 4 times, but at the moment I got 2000 books without ISBN (lot of gutenberg project so they will never get an isbn).
3. I've also got some books with only a isbn, not an indication it is isbn. In all these cases the isbn is on one of the first 3 pages or on one of the last 2. This is a moment to use the regex without match 0. I could think of an (drobdown/submenu) option to use the regex this way. By default u use the standard way as it is / will be now. If you still got no ISBN error, you could use the second option (in this case I should not update existing isbn-numbers because it is less certain it is an isbn, or you should have to check the isbn validity before inserting it).
4. I have some isbn numbers the dit not got converted very well using ocr. In this way I got this letters or signs as numbers. To correct this, you have to check the validity of isbn-numbers if you replace the letters. Creates a much slower script. I would like to have it, but I build it myown if I will be the only user ;).
Signs i got:
i I (i) ! l (L) | (or) { } (all for the number 1)
o O B 8 (for a bad printed zero but also for a 8)
b (for a 6 or 10)
d (for a 01)

So some new regex to be made. But the question in this case is what is the best way. To get all mentioned cases into isbn, the plugin will be slower (in bad cases) but it is possible the parser will get isbn-numbers in books that do not have an isbn-number mentioned (I think there will be cases...)

As for your question about searching from the end of the book. I assume this is a performance thing - and my answer remains the same as previously. I am at the mercey of the current implementation of the Calibre input converters. They do not stream the results to me, I cannot control their direction. I give them a path, and when they are "done" they give me a bunch of stuff representing the converted EPUB back.

If it is possible but only a case of performance, you could possible ad it as a submenu option. It is not possible to set is as default, would be a performance losse, but if someone wants it, he/she could use it.

Thank you for your time and if you want (even more ;)) info, just reply here or mail me.

kiwidude
03-30-2011, 04:17 AM
@user_none - thanks for taking the time to take a look and explain. I really know zero about PDF conversion so appreciate the information you have given here and on the ticket. I fully expected that if it was anything other than a trivial change there would not be any interest in making a change to the code. I will try Kovid's suggestion of taking a look at the reflow.py stuff for processing PDF files.

@drMerry - to be honest I don't have a massive interest in trying to support really badly OCR'd documents. I would much rather support the majority of what users are after which is an ISBN from valid documents (that they are less likely to be binning!). As performance is already an issue I don't intend to compound that.

I have a suggestion from Kovid about some alternative code in Calibre to use for handling PDFs (part of the new in progress PDF engine) so I will give that a go and see what options if any I could introduce around it. I haven't looked at it yet but from what Kovid mentioned scanning the first 10 pages only will be easy to support, however I would guess scanning the last few might not be possible without scanning the whole thing. Still, at least offering that as a config option could help performance for the majority of docs where ISBN is at the front.

In terms of your case (1) above of the ISBN immediately followed by a linefeed. If you can PM me a link to where I can download the doc I will give it a spin once I have changed the PDF handling code and see if we can handle that case.

drMerry
03-30-2011, 04:21 AM
@kiwidude

I made a mistake in my regex.
The new regex will not find isbn-numers with a length of 10 without any spaces, dots or dashes. This is because it checks 2 groups now, 10-24 positions and 1 last position. This must be 9-24

After I realized by your post, the regex was used based on proof of concept, I thought about some optimalization and I concluded this (will test it tonight at home)

isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end)
http://www.isbn-international.org/faqs/view/5#q_5

so you do not have to test for words like isbn or something like that. All extra test will consume exponentional time (you have to test for isbn AND isbn: AND 1sbn and.....)

so I thought op this implementation
it will search for digits (not 0-9, this is 10 tests, a digit is 1 test)
it will search for white-characters (\s) so spaces are counted but also tabs are
quick search: (97[89](\d{6}|\d{9})[\dxX]) use match(0)
optimal search (97[89][\d\s\-\.]{6,24}[\dxX]) use match(0)
extended search like optimal, but you also add some of the mentioned characters to get more isbn numbers. This is a heavy implentation because you have to replace the numbers and afterwards have to test if you got a real isbn, otherwise you still have to tell it is not found.

To be more sure you get isbn-numbers and not phone numbers, you can add some extra info like it may not be prefixed with the word tel or + or 0
But I think this is not needed as far as I know. I did a quick test with implementation 1 and 2 on a textfile (not with calibre).
Both processed the file in a fraction of time compared to the original regex, and I got more numbers than in the original case.

I post this on the forum because I hope there are people who can think of a (regular) event where my idea would fail and it would work if you just tested the if the word ISBN was available as prefix.

drMerry
03-30-2011, 04:24 AM
again we where working at the same time in this threat

I can see your opinion about bad ocr. That is a valid opinion (off-course, opinions are always valid...)
May be it is an idea for another plugin. I will think of that in the (near) future.

kiwidude
03-30-2011, 04:44 AM
@drMerry - yeah we crossed posts again. I have no objection at all to changing the regex or matching algorithm if you have a better one - as I said above I did not write that part of the code nor have I ever investigated all the variations. I just wanted to offer an easy to use wrapper around something that people could use within Calibre rather than running external scripts etc.

So if you are willing to do the investigation can come up with an improved version that isn't noticeably slower than the existing one then by all means please send it to me. PM me your email address if you like and we can swap info there. My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.

theducks
03-30-2011, 11:53 AM
isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end)
http://www.isbn-international.org/faqs/view/5#q_5
Note:
ISBN-13 can not have an 'X' (EAN13 is digits only), that only applies to ISBN-10

ldolse
03-30-2011, 09:41 PM
My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.

There is a 'check_isbn' function that is already in use in the various calibre metadata plugins that do some validation on whether a specific string of numbers is truly an ISBN vs a random string of numbers like a phone number. These get used before the metadata plugins send an ISBN to a metadata provider, but they should be good for this too.

from calibre.ebooks.metadata import check_isbn

kiwidude
03-31-2011, 03:49 AM
@Idolse - thx, yes I do indeed already make use of that in the plugin, so if that should be a sufficient failsafe then that is good news :)

drMerry
03-31-2011, 08:16 AM
An update.

Check isbn is indeed used and functions well I see.
I have made this version.
Works 2 times faster than original.
I scanned 600 epubs that had no isbn (Not checked if there was ISBN inside it)
I got 100 new ISBN-nrs

Seems nice, BUT:
I had 2 non- (but valid) isbn-nr's
There were isbn-nr's in the file. The numbers I found, where there because of a bad epub conversion.

You can not use \d. you have to use 0-9 because with \d calibre freezes on some files.
I have some trouble with multi-line

I can detect:

NUR 123
ISBN 1234567890

and

NUR 123
ISBN 123 456.78
9

0

and

123 456.789

0

but NOT
NUR 123
1234567890

In this case 1231234567 is returned as posible isbn and found bad
(EDIT: ADDED 7, Off-course I do not get 213123456..)

Maybe someone can find a solution?

I build in some restrictions to avoid some problems
13 or 10 0's is a valid isbn, but you don't want to extract that
I also test isbn 13-numbers if they start with 978 or 979. If not, I do not even test validity.

I'm a bad programmer in case of changelog, made some log info
I changed extract_isbn_code
Added strings on top of the file
changed the regex
changed loor_for_isbn_in_text

I'm not a py programmer so I someone knows a better way to do the txt.replace (strip all whitespaces (including \n and \r) and removing - and .)

At the other hand, I have sometimes put an isbn including - into the meta-info and calibre updated the info itself. so maybe only \n\r needs to be removed?
(in this case you don't even have to (and can't) test for 10 / 13 isbn. So it should go even faster

I also included a pdf with legal isbn-ranges. If you add this check, next to the validity check, you're 99.99999% sure it is an ISBN-number

drMerry
03-31-2011, 10:11 AM
I just tested some new ebooks
PDF is still extreme slow.
The pdf-slowness is because of the pdftohtml process. This uses on all my pc's 50% of my cpu (1 complete core). Maybe a bug in calibre?

There will be more errors if u try to index an math-book or a technical manual (Because of the large number of large numbers)
But that will be a problem for a minority of users (including me).
Maybe you can add an option to only check numbers with isbn notations in front (like it is at this moment)

theducks
03-31-2011, 11:17 AM
Some of my really old Dead Tree &trade; books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)

I backed in the check digit by trying [0-9X] until Calibre gave me a Green :thumbsup: ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.

kiwidude
03-31-2011, 11:37 AM
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.

In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.

As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker :). If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.

I don't want to have a whole bunch of options on this plugin, it is why I have resisted putting a menu onto it as there are too many permutations. I think of how I see people using it - they will give it a one click shot at trying to find an ISBN, and after that they will use a metadata download type lookup based on title/author matching. I really don't see them wasting a lot of time bothering making multiple attempts on the same book using different options? If it fails and they believe there "really must" be an ISBN in there, they will view the book and type it in if it means that much to them (which they will have to do for any graphical based PDFs anyways).

However that is just my opinion on how I see people using it. :) If it handles 98% of the book ISBNs out there that is still an improvement without it.

drMerry
03-31-2011, 12:39 PM
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.
I think so.
I had a pdf of 700 pages.
163 MB
Took me more than half an hour to know your (also with my regex) tagger could not find an isbn

In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.
I mean that a book with a lot of numbers have more change to have a number that is conform ISBN-standard. So this could give a false positive.

As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker :). If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.
I often see pdf-files with isbn crossed over the front page (because the ocr can not handle the forntpage/picture.) Rest of document is good in this case.
This is off-course a ocr error and I can understand you do not want to invest in bad ocr. Because I've seen it often in books with isbn on the front cover, I myself should add the newline option. To test isbn numbers and try to recover a good isbn outof iop830l|Ix would be something else.
On the other hand, If I do not add the \s in the regex, I can not retrieve isbn numbers with the last number right before a linefeed.

@your opinion about 98% and a lot of sub-options:
agree

drMerry
03-31-2011, 12:42 PM
Some of my really old Dead Tree &trade; books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)
That's a nasty one. If this is the case for e lot of books of this period, it would be a drawback.

I backed in the check digit by trying [0-9X] until Calibre gave me a Green :thumbsup: ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.

I can confirm I also have never seen it mixed. You have not seen dots between them?

theducks
03-31-2011, 02:08 PM
You have not seen dots between them?
You really expect me to remember :smack: a possible 1 or 2 out of 900+ :rofl:

All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results :eek: (90% passed)

All checking the (failed) entry against the book printing was it was not a 'fat fingering' problem :chinscratch:

I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.

drMerry
03-31-2011, 06:37 PM
You really expect me to remember :smack: a possible 1 or 2 out of 900+ :rofl:

You don't?
not a real e-book reader than :p

All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results :eek: (90% passed)
...
I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.
I myself am not sure about the fact if I've seen dots.
spaces and --- are sure.
Every new added character will slow down the process a bit (noticeable on large number of pages to be scanned).

But I think for speeding up the process we will have to wait for the mentioned replacement of pdftohtml

Loeffel
04-02-2011, 09:36 AM
Hi,
normally the extraction runs fine, but if I try to scan many ebooks at once with the auto feature, the plugin hangs and the only way to go on is to kill Calibre (don't know at which number of books).
First I thought, it is ok, but then I saw that always, when this happens, the plugin doesn't go on, it stops at the first ebook. Scan this book only or just a few, no problem.
I can scan about 300 books at once without problems.

kiwidude
04-02-2011, 09:45 AM
@Loeffel - I would be suprised if it is "hanging", I think it more likely you are hitting some large PDFs that it is struggling with time to analyse. If you run the plugin in debug mode (Ctrl+Shift+R) you should see it continuing to display output as the input converters do their thing.

Loeffel
04-02-2011, 11:35 AM
I have no real big ebooks, but some in different formats perhaps that's the problem. Is there any way to tell the plugin just to search the first format found?

kiwidude
04-02-2011, 11:58 AM
Go to the customisation for the plugin, and you can set its behaviour. However I think by default the alternate search is set to only check the first format in preferred input order.

It doesn't necessarily have to be a massive PDF, but just PDFs in general will slow it down, by how much depends on the content I think moreso than size. If it has lots of graphics I think that makes it grind rather slowly. There's a few posts in this thread about it if you read back. Now 0.7.53 is out I can start experimenting more seriously with the "first 10 pages/last 5 pages" approach to scanning which hopefully will improve things.

Loeffel
04-02-2011, 08:39 PM
I found it. I will let it run while I'm sleeping. I will see what happened when I come back. If it just looks like or if it really f...s up.

Loeffel
04-03-2011, 04:10 AM
Ok, it just needs that long and the computer says Calibre is not responding and the counter stays on 1.
I just saw that there are books that have an ISBN (like 3-442-04.273-9) but it wasn't found, but in the text are two numbers:

Ungekürzte Ausgabe • Made in Germany Đ 1973 by Sara Woods. Aus dem Englischen übertragen von Tony Wester-mayr. Alle Rechte, auch die der fotomechanischen Wiedergabe, vorbehalten. Jeder Nachdruck bedarf der Genehmigung des Verlages. Umschlag: Foto von Gilles Lagarde. Gesetzt aus der Linotype-Garamond-Antiqua.
Druck: Presse-Druck Augsburg. K 888/KR1MI 4273 • Sch.’Hu Gebundene Ausgabe ISBN 3-442-25.888-X
Taschenbuchausgabe ISBN 3-442-04.273-9

Is that the reason why the plugin doesn't come up with an ISBN number? I can search it but he always states there is none in the text.

kiwidude
04-03-2011, 04:31 AM
@Loeffel - the reason why it doesn't get that ISBN number is because it contains a mixture of dashes and periods. Previous posts have questioned whether people had seen this situation - obviously you have now found such a case (most likely given your pasted text due to it being a European edition of a book).

Your "counter stays on 1" comment - do you mean it only finds one ISBN on all your books? In which case yes this is due to the regex.

As I have posted several times I took the regex used by this plugin from someone else's extract ISBN script on the assumption that it was an evolution of many people's attempts over time before it. Clearly it was not as "proven" as we would like given these variations from yourself and drMerry. I'll take a fresh look at that part of it over the next few days, including what drMerry has been experimenting with.

What would be extremely useful is a list of test case ISBNs of variations people have seen - if people could please post these (either stick them in a text file attachment or just post directly in the thread, either will be fine). That way I can make sure the next implementation will cater for your examples.

theducks
04-03-2011, 11:04 AM
@Loeffel - the reason why it doesn't get that ISBN number is because it contains a mixture of dashes and periods. Previous posts have questioned whether people had seen this situation - obviously you have now found such a case (most likely given your pasted text due to it being a European edition of a book).

Your "counter stays on 1" comment - do you mean it only finds one ISBN on all your books? In which case yes this is due to the regex.

As I have posted several times I took the regex used by this plugin from someone else's extract ISBN script on the assumption that it was an evolution of many people's attempts over time before it. Clearly it was not as "proven" as we would like given these variations from yourself and drMerry. I'll take a fresh look at that part of it over the next few days, including what drMerry has been experimenting with.

What would be extremely useful is a list of test case ISBNs of variations people have seen - if people could please post these (either stick them in a text file attachment or just post directly in the thread, either will be fine). That way I can make sure the next implementation will cater for your examples.

A Mixture of dashes and dots or spaces was not in the spec I read a long time ago. Any single method was permitted.

Language-Publisher-Book_number-Check_digit

where Language and Check_digit are single characters and the others add up to 8 characters.
Are they trying to sneak in the publishers 'Imprint' encoding into the Book_number?

kiwidude
04-03-2011, 03:41 PM
Firstly, thanks to drMerry for the suggestions and testing in this thread. It has become obvious from several of you that the original regex used in this plugin was extremely conservative. For this release I have used a variant of what drMerry proposed (no longer looking for textual prefixes like ISBN) which significantly increases the match rate.

I have also replaced the PDF processing to something that is many orders of magnitude faster, by only scanning the first 10 and last 5 pages of a PDF.

Changes in v1.2:

Rewritten for new plugin infrastructure in Calibre 0.7.53
ISBN matching regex replaced
PDFs now processed with new Calibre PDF engine to scan just first 10 and last 5 pages


See the attached text document for my test cases. Note that this release still makes no attempts to catch bad OCR scans (e.g. O instead of 0, I instead of 1 etc). It also will not match numbers split across multiple lines, or text underneath graphics. I have also not as yet optimised scanning non PDF formats.

It should however run significantly faster for PDFs and give you more matches than previously.

Loeffel
04-03-2011, 06:38 PM
I will have a look for other ISBN types different from this I've posted and those in the textfile.

What I meant with the 1 is that if I scan only a few books for an ISBN then the number shows which book is scanned. If there is a a great number to scan he will show 1 until the scan is finished.
I can say exactly how many books just have only 1 format. Just 4
- 1 empty entry (placeholder for a book that will be published in may and which I've already bought)
- 3 dictionaries which I exclude from all conversion and such things as they are large and nothing to be found in them at all

All other books have a minimum of two formats (epub and mobi)

But nevertheless this is a good plugin. I have some suggestions for it, but I need a little bit to write down, what I mean.

ldolse
04-04-2011, 12:46 AM
The new pdf functionality seems to be broken, get this for any pdf:calibre, version 0.7.53
Extract ISBN complete: Selected 1 books
Found 0 ISBN values
Updated 0 books

See details for more information

[book title here] - ERROR: (<type 'exceptions.TypeError'>, TypeError('function takes exactly 1 argument (3 given)',))

kiwidude
04-04-2011, 04:08 AM
@Idolse - you will need to do a reinstall of the binaries for the 0.7.53 release (I believe you are running from source). Sorry for not mentioning that. Kovid recompiled the reflow C++ app.

ldolse
04-04-2011, 05:27 AM
Ouch - the first pdf I tried caused a segfault (nothing to do with your plugin, more a poppler issue).

Fortunately the other pdfs I've tried have all been ok. Will open a bug on the problem pdf.

kiwidude
04-04-2011, 05:32 AM
Glad you are up and (mostly) running now.

Indeed by taking this approach with this plugin and using the "still work in progress" new PDF engine there was always going to be a risk of hitting some issues. Other than hitting the odd pdf that caused an immense amount of debug messages I hadn't had any crashes like you got but then I haven't exactly thrashed it.

Still, I'm sure Kovid will be "delighted" to find and fix them :)

drMerry
04-05-2011, 02:33 PM
Great performance boost.
Real nice. Faster and more success.
Thanks.

But, 1 problem with the new pdf-parser. Calibre crashes (In my case it where all bigger pdf-files (10 - 170 MB).

Probleemhandtekening:
Gebeurtenisnaam van probleem: APPCRASH
Naam van de toepassing: calibre.exe
Versie van toepassing: 0.7.53.0
Tijdstempel van toepassing: 4d961400
Naam van foutmodule: pdfreflow.pyd
Versie van foutmodule: 0.0.0.0
Tijdstempel van foutmodule: 4d9613dc
Uitzonderingscode: c0000005
Uitzonderingsmarge: 00005e18
Versie van besturingssysteem: 6.1.7601.2.1.0.256.1
Landinstelling-id: 1043
Aanvullende informatie 1: 0a9e
Aanvullende informatie 2: 0a9e372d3b4ad19135b953a78882e789
Aanvullende informatie 3: 0a9e
Aanvullende informatie 4: 0a9e372d3b4ad19135b953a78882e789

One other question. Is it posible to set progress in an option-pane?
At this moment, Calibre is just wating for the plugin to stop. I can not (really) use calibre during scans.
When it is in an option pane, I could hit cancel (or run at background to get the curent implementation if this function is usefull for others).

But, thanks again for your work.

kiwidude
04-05-2011, 03:18 PM
@drMerry

I would suggest (unless Kovid says otherwise) that any issues with the new PDF engine you put on the bug tracker, attaching the pdf for Kovid to take a look at. It sounds like the new PDF engine isn't being actively developed right now, but at least Kovid would be able to replicate the issue whenever he does next work on it.

As for running the scan in the background, that isn't going to happen anytime soon I'm afraid. The only background processing mechansim I have seen in Calibre is the "Jobs" stuff that gets used for when you convert books. However the risk with it (and possibly the same reason why stuff like download metadata doesn't run in the background) is that you have all the concurrency issues of two different things updating the same book record at the same time. I don't know if you have ever noticed but it is possible to lose your newly converted book from a job if you happen to be editing metadata for the same book at the same time the job completes.

Now maybe this is something Kovid plans to address in future, such as with some sort of optimistic or pessimistic locking mechanism which would prevent you editing the same book a job was running for. If he does, then I am sure I could look into revisiting it. Right now, I don't want to run the risk of any database corruption by a user being allowed to edit a book manually while the ISBN is being updated in the background.

ldolse
04-05-2011, 07:08 PM
That crash might already be fixed, I'd suggest waiting til the next release and checking again - Kovid already fixed the crash I mentioned a few days ago, it may be the same crash. I can't tell by your error log for sure if it's the same, as it's a different language/OS, but there is a good chance it's the same.

drMerry
04-08-2011, 04:19 AM
@drMerry
As for running the scan in the background, that isn't going to happen anytime soon I'm afraid. The only background processing mechansim I have seen in Calibre is the "Jobs" stuff that gets used for when you convert books. However the risk with it (and possibly the same reason why stuff like download metadata doesn't run in the background) is that you have all the concurrency issues of two different things updating the same book record at the same time. I don't know if you have ever noticed but it is possible to lose your newly converted book from a job if you happen to be editing metadata for the same book at the same time the job completes.

Is it possible to use a dialog-box to the process so you could tell callibre to stop the process (directly or after the check of the current pdf is completed)?

I had once selected all my books and by mistake started the plugin.
I can tell you. It takes a long time to check all 3250+ books (stored on network-drive in use with other processes and with old pdf-engine causing all my pdf-files complete parsed (books of 1400+ pages is no rarity in my lib)

So a abort option would be welcome.

kiwidude
04-08-2011, 04:31 AM
It already has an abort option for interactive usage (i.e. if you are being asked which format to scan when there are multiple).

When it is operating non-interactively there currently is no way to stop it other than killing Calibre. As this is not the sort of thing you would be running repeatedly on your whole book collection I figured I could get away with it for a while.

Putting a dialog up and running the scan in the background is obviously possible, it just involves a lot more development. And some threading, something which is fraught with potential to go horribly wrong in Python/Qt if done badly.

It's on the future wishlist to take a look at.

kiwidude
04-09-2011, 08:50 PM
Changes in this release:

Support skinning of icons by putting them in a plugin name subfolder of local resources/images

drMerry
04-11-2011, 01:46 PM
Hi,

Something new. (I'm not stalking to tear down your product, but just because I really love it!!).
I've got a lot of scientific papers. These papers often do have a last chapter "Recommended further readings" And yes, with ISBN.

In this case there are 3 options:
1. ISBN of document is not available
2. ISBN of document is on first page(s)
3. ISBN is at the end of document (after further readings)
(I've not seen ISBN before further readings.)

Is it possible to change the behavior in look at first x pages top down
Look at last x pages bottom up?
Or would this decrease speed a lot?

kiwidude
04-11-2011, 01:51 PM
Are these PDFs or other types of documents?

The plugin already does look at the first 10 and last 5 pages of PDFs (and the entire document for any other format).

The only thing that is different that you are asking for is for the last x pages check to work backwards. The question is why - what would you hope to achieve? It certainly wouldn't gain much speed in PDFs.

drMerry
04-11-2011, 05:44 PM
My problem is that in scientific documents / books (most PDF) You have the last chapter something like this:

Recommended further readings
Checking ISBN in pdf - K.I. Wi Dude. ISBN:1234567891234

EPub and PDF Pro's and cons. - dr. Merry ISBN: 4321987654321


About the author
Robin Hood is a financial expert.

Calibre Publishing 2011
ISBN: 9786453210235
So as you can see, the first 2 ISBN-numbers are not related to the current document. The last one is.
This is often seen in scientific documents.
The real ISBN is on one of the first pages,
After the further reading section
or not in the document at all.

kiwidude
04-11-2011, 06:17 PM
Ok, thanks, I understand now.

I'll put on the list to make a change to cater for this for PDF documents hwhich is a fairly trivial change. But not for other format types as yet (as they currently have no concept of "pages").

olandese
04-13-2011, 02:06 PM
After the plugin sets the ISBN i am not able anymore to open the book, i have to restart Calibre and then i can open the book again. I am using Calibre 0.7.54

kiwidude
04-13-2011, 02:23 PM
Hi olandese, welcome to MobileRead.

That is a new behaviour and to be honest I can't think how it could be related to the plugin.

What do you mean "not able" to open the book - exactly what happens?
What action are you doing to open the book?
If you run Calibre in debug mode using ctrl+shift+r can you post any messages that appear when you try to open a book in this scenario.
What book format(s) does this apply to that you are scanning/opening.
Does this apply to all books or just specific ones?

olandese
04-13-2011, 05:15 PM
Hi kiwidude!

i did a little more investigation and the problem occours only with .chm files.
After the plugin runs and it sets the isbn number i try to open the file (double click on it) from calibre and i get the following error:


calibre, version 0.7.54
ERROR: Unhandled exception: <b>WindowsError</b>:[Error 1223] De bewerking is geannuleerd door de gebruiker: u"R:\\Calibre\\O'Reilly\\Perl 6 & Parrot Essentials 2nd (443)\\Perl 6 & Parrot Essentials 2nd - O'Reilly.chm"

Traceback (most recent call last):
File "site-packages\calibre\gui2\actions\view.py", line 156, in view_triggered
File "site-packages\calibre\gui2\actions\view.py", line 195, in _view_books
File "site-packages\calibre\gui2\actions\view.py", line 52, in view_format
File "site-packages\calibre\gui2\actions\view.py", line 87, in _view_file
File "site-packages\calibre\gui2\actions\view.py", line 78, in _launch_viewer
File "site-packages\calibre\gui2\__init__.py", line 628, in open_local_file
WindowsError: [Error 1223] De bewerking is geannuleerd door de gebruiker: u"R:\\Calibre\\O'Reilly\\Perl 6 & Parrot Essentials 2nd (443)\\Perl 6 & Parrot Essentials 2nd - O'Reilly.chm"

It seems to be that the file is still in use.

kiwidude
04-13-2011, 05:29 PM
Hi olandese,

Thanks for the details. It sounds like something Kovid might have interest in on the Calibre bug tracker. My plugin just calls Calibre code to try to read the ISBN, and I would suspect that in the case of CHM files that input converter isn't releasing resources properly somehow.

I'll confirm it isn't anything to do with the plugin itself and put it on the bug tracker for you if you like.

kiwidude
04-13-2011, 05:39 PM
By the way folks I have found a few "false positive" situations in using this plugin which I don't forsee being able to do anything about.

Here are a couple of examples I came across:

A Wrox book which in one of the leading pages has a list of other Wrox books with their ISBNs, before the actual page containing the ISBN for this book. So it picks up the ISBN for some other Wrox book in that list rather than the one you have. The filth of publishers advertising in their own books I'm afraid.
A book that had 2222222222 somewhere in the leading text. As it turns out by some coincidence that passes the valid ISBN check.


It is rare enough to not avoid using the plugin, but a reminder that this is just a tool trying to automate a human function of reading text and as such can sometimes get it wrong.

olandese
04-14-2011, 03:08 AM
Hi olandese,

Thanks for the details. It sounds like something Kovid might have interest in on the Calibre bug tracker. My plugin just calls Calibre code to try to read the ISBN, and I would suspect that in the case of CHM files that input converter isn't releasing resources properly somehow.

I'll confirm it isn't anything to do with the plugin itself and put it on the bug tracker for you if you like.

Yes please!

the plugin is also much faster with pdf than with chm files.

kiwidude
04-14-2011, 03:20 AM
Yes please!

the plugin is also much faster with pdf than with chm files.

From posts I have seen elsewhere I believe chm is a difficult format for Calibre to handle, and is best avoided in general if possible if you intend to read the book anywhere but on your pc. This plugin just calls the same code to read the book pages in calibre that doing a conversion would do, except for PDFs which it was easier to put an optimisation in for. So any issue with performance I can do nothing about unless calibre is able to do it's part faster.

kiwidude
04-14-2011, 07:44 AM
i did a little more investigation and the problem occours only with .chm files. ...

It seems to be that the file is still in use.

I've confirmed the issue and you are dead right, the chm input reader plugin is not releasing a file handle. I'll add a report to the bug tracker, I've isolated it down enough to prove a fix that "works" but Kovid or someone will need to put it in the right place. So it will require a Calibre release to get the plugin to allow you to open CHM files without restarts.

olandese
04-14-2011, 09:03 AM
I've confirmed the issue and you are dead right, the chm input reader plugin is not releasing a file handle. I'll add a report to the bug tracker, I've isolated it down enough to prove a fix that "works" but Kovid or someone will need to put it in the right place. So it will require a Calibre release to get the plugin to allow you to open CHM files without restarts.

Fine, i will wait for the next Calibre release :)

drMerry
04-14-2011, 03:06 PM
Well, all numbers with 10 times the same number are correct.

I think you can add a string to tell 10 times same number is not correct. The plugin will tell there is no ISBN found. There are 10 books (at the moment not assigned) numbers with false negative. But there will be a lot less false positive.

The problem with the list of ISBN-numbers is a bit like my earlier question about a list at the end of the book.
There are 2 things you can do about this:
1. Let it be
2. Test if there are more ISBN-numbers in the part you're looking at. and then:
a. choose first or last insert it.
b. Tell user there are more ISBN-numbers found, so no one is entered
c. Tell user there are more ISBN-numbers found, they can choose the right one out of a list

But the whole part 2 will slow down the process (extreme) because you have to go on testing also if you have already found one and you have to create a list of numbers (also if you have only one number) which is slower than just 1 string.

kiwidude
04-14-2011, 03:18 PM
@drMerry - I did not realise all 10 times the same number were "valid" (but not really) ISBNs - as you say that sounds a sensible suggestion to check for that and discard it.

As for the multiple ISBNs, I'm going to let it be. The user wouldn't have a clue which is the right one without actually opening the book and it all just gets too hard.

I have today seen how Kovid is handling the background downloading of metadata in the new code for 0.8. I'm going to steal it and use the same approach. What it will do is use the jobs mechanism to run the extract ISBN on your books, and then pops up a dialog when it is finished toat that point start updating the books. It also looks for the last modified of the books and asks the user what to do should they have edited the book while the job was running. So that should keep people concerned with either speed or blocking Calibre happy.

That will however mean I need to rethink all those "interactive" options for choosing books. I might make it that you never get asked, and it always just uses a preferred order. Or I could make it that you can actually define your own preferred order in the configuration dialog, rather than using the preferred conversion input order. What do people think?

I will also change that scan last pages logic to look in the reverse direction for you drMerry.

My other thought which I mentioned on another thread was to add an option to allow scanning for an ASIN as well as ISBN. However is that idea flawed - do ASIN only books actually have the ASIN printed inside them? Does anyone have some examples of books with an ASIN they can give me? The search wouldn't have the same flexibility of numbers, that we changed ISBN to - I was thinking it was just search for ASIN: xxxxxxxxxx or similar. But as I say if ASIN is actually not included inside the PDF/EPUB then it is all a silly idea really :)

drMerry
04-14-2011, 08:33 PM
I did not find an official file with asin in it in my lib (some handmade ebooks had however)

About the 10 times same number, I did not know it either, but you can try some at:
http://www.isbn-check.com/

Background function seems nice. (is it in build 8840?) I myself would prefer a way you could modify the order in the settings. Sometimes you just want to work in an other way than normally.

Reverse scan for last pages would be great.
If this slows down the process, you could maybe add it as an option in your settings (while it seems you will have to rebuild it all, this could be a good moment if you want to add such option)

ldolse
04-14-2011, 09:07 PM
I doubt ASIN would be very common, I just checked a few recent Amazon mobis and it's not there, unless Amazon sticks it in some metadata location Calibre doesn't check. I highly doubt it would be in another bookseller's edtion...

kiwidude
04-14-2011, 09:27 PM
Thx for the info guys, I will trash the ASIN idea then and leave that up to the metadata download plugins in 0.8

@drMerry - the background metadata download is in the latest source but I would assume it won't be turned on until 0.8 is released. Thx for the ISBN check site. As for the reverse checking, I don't forsee any noticeable overhead for that at all so will just keep it simple as the default behaviour.

drMerry
04-15-2011, 01:28 PM
Another False-Positive ISBN-problem found

At the moment I get a lot more ISBN numbers then at the time there was the ISBN-text test.
I have some new false positive though. But there is a solution for it.
The problem is in the 13-number ISBN.
A 13-ISBN-number needs to start with 978 or 979.
I got some numbers starting with random other numbers. Checksum is all right though.

If there is a check on 978 or 979 start for 13-digit ISBN-numbers, this problem is solved

kiwidude
04-15-2011, 01:36 PM
Another False-Positive ISBN-problem found...

If there is a check on 978 or 979 start for 13-digit ISBN-numbers, this problem is solved

Ahhh, of course - I had forgotten that permutation falling out of the regex. In fact it makes me wonder if the regex is actually a bit nonsensical in it's current form. I think all this "(9[\-\. ]*7[\-\. ]*[89])" should get ripped out and just replaced with a simple check once we hit a 13-digit number.

kovidgoyal
04-15-2011, 02:44 PM
IIRC, there's no requirement that 13 digit ISBN start with any fixed set of numbers. The only reason that most (all?) current ISBN 13s do so is because those two spaces haven't been exhausted as yet.

theducks
04-15-2011, 03:11 PM
IIRC, there's no requirement that 13 digit ISBN start with any fixed set of numbers. The only reason that most (all?) current ISBN 13s do so is because those two spaces haven't been exhausted as yet.

The EAN council controls the numbering (bar code)

977 (ISSN) Periodicals
978 and 979 (Book Lan AKA ISBN-13)

98 and 99 are already assigned to (coupons)

I did not see lower 97[0-6] on the list

It may have been short sighted :D of them wen this was started about 20 years ago, not to reserve a larger block.

kiwidude
04-15-2011, 03:15 PM
IIRC, there's no requirement that 13 digit ISBN start with any fixed set of numbers. The only reason that most (all?) current ISBN 13s do so is because those two spaces haven't been exhausted as yet.
I've just looked this up here:
http://www.isbn-international.org/faqs/view/5

According to them:

Prefix element – currently this can only be either 978 or 979 (it is always 3 digits).


I guess the issue is the definition of "currently" :)

I guess what I could do is just keep any ISBN-13 number it finds and keep scanning until it finds one with 978/979. If the latter is present, that will get returned.

Or I could throw it in as a configuration option as to whether to accept things other than 978/979. I would rather just be relying on the check_isbn logic in Calibre though for consistency.

kovidgoyal
04-15-2011, 03:52 PM
I would suggest preferentially using a 13 digit match that starts with 978/9. calibre's check_isbn13 only checks that the check digit is correct. It makes no assumptions about the first 3 digits.

drMerry
04-16-2011, 10:22 AM
I've just looked this up here:
http://www.isbn-international.org/faqs/view/5

According to them:
....
I guess the issue is the definition of "currently" :)

The problem is currently indeed.
The (c) at the end of the pages states 2009.
I however can not find any more recent document. So it seems this are the only available at the moment.

But it would be a good idea to add an settings option to add this.
By default you could add 978 and 979. If they get more numbers, and the plugin (or calibre) is not developed anymore, you could add them manually

telemetrics
04-16-2011, 03:36 PM
I just downloaded Calibre and was just wondering about this feature. Thanks a lot.

Feature 1: OCR
Is it possible to extract first and last 3/4 pages of an eBook and run this on an OpenSource (or Free) OCR.
http://code.google.com/p/tesseract-ocr/

Feature 2: Autorun "Download metadata and covers" for all files where ISBN was found.

Feature 3: Detect ISBN in File Name.
ISBN number in File Names are found in some cases. They may not have a the prefix of the string 'ISBN' but just direct number ISBN10 or 13. However we need to clean the special chars like Underscores and Square Brackets.

Feature 4: ReOrder Suggestion based on Name
Incase multiple ISBN numbers are found then we could show the options and let the user select one (in just one click). The Optional ISBN Numbers can be looked up and the titles and authors can be displayed next to it.
However these should be ordered based on the Distance from the Title of the option to the file name of the ebook.
http://en.wikipedia.org/wiki/Levenshtein_distance

kiwidude
04-16-2011, 04:11 PM
@telemetrics - thx for the suggestions.

Anything related to filename won't work, the filename is the name Calibre has given it, not whatever it might have had originally. If you have books with ISBN in the filename then you can use the file pattern at the time you add the book to pick that up.

I'll have a think about your other points when I get some time before I respond - I've got a lot of other changes for both this plugin and others that I have to get sorted first. Thanks for the suggestions though and perhaps others may have feedback on them.

ldolse
04-16-2011, 08:13 PM
I just downloaded Calibre and was just wondering about this feature. Thanks a lot.

Feature 1: OCR
Is it possible to extract first and last 3/4 pages of an eBook and run this on an OpenSource (or Free) OCR.
http://code.google.com/p/tesseract-ocr/

Feature 2: Autorun "Download metadata and covers" for all files where ISBN was found.

Feature 3: Detect ISBN in File Name.
ISBN number in File Names are found in some cases. They may not have a the prefix of the string 'ISBN' but just direct number ISBN10 or 13. However we need to clean the special chars like Underscores and Square Brackets.

Feature 4: ReOrder Suggestion based on Name
Incase multiple ISBN numbers are found then we could show the options and let the user select one (in just one click). The Optional ISBN Numbers can be looked up and the titles and authors can be displayed next to it.
However these should be ordered based on the Distance from the Title of the option to the file name of the ebook.
http://en.wikipedia.org/wiki/Levenshtein_distance

Adding OCR seems like an inordinate amount of work for a very small return just to discover the ISBN number in a small handful of books. I doubt that C code can be included in a plugin, it would generally require integration with Calibre and Calibre's build process, which also requires the OCR project to be set up for reliable cross-platform compilation. Beyond that, as it currently stands the pdf engine can't be trusted to reliably get detect/extract images from an image based pdf. Not sure if the new pdf engine is any better.

Number 2 can be accomplished by typing ISBN:True in the search box after using the plugin, highlighting everything, and clicking ctrl-D.

Number 3 can be done while importing the book as Kiwidude noted. There are a number of threads in the library management subforum, if you're not sure how to go about it I suggest searching/asking there.

While number 4 is something that could be done it seems like a lot of work for again little ROI (and the selections would likely include lots of false positives trying to guess if there is a title in the vicinity of the ISBN) - kiwidude maintains the plugin, so tackling something like that is up to him, but personally I'd rather see him investing his time in the dup detection plugin or one of the other projects.

kiwidude
04-19-2011, 06:17 AM
Ok, so here are my plans for the next version of this plugin:


The scan will run as a background job, and popup with a dialog when done, just like the metadata download does in Calibre 0.8
I am removing all of the interactive choices/options. These were put in to get around the fact that a scan could take a long time and have made the code much more complex than I like. I want just a single "Extract ISBN" option that runs in the background and scans all formats for the book in preferred input order until it finds one.
If it finds any new or updated isbns, then it will "mark" those books and issue a search of "marked:new_isbn". So if you want to do a metadata download on just that book subset you can do so. I might make this a config option to turn off if people don't want their library search changed after isbns are found.
I will follow drMerry's suggestion of a configuration option for valid ISBN13 prefixes, defaulting to 978/979. I would rather that than get values that at this point in time we know for sure are not valid ISBNs.
For scanning PDFs, the final 5 pages will be scanned in reverse order. For all other formats it will have the same behaviour as now of scanning the whole book from front to back. The latter is something that might be optimised in future but it is a lower priority imho.


Any objections to the above feel free to comment on.

theducks
04-19-2011, 07:15 AM
You might want to include ISSN (977) for those that store Periodicals :bulb2:

kiwidude
04-19-2011, 07:19 AM
Thx, will do.

capnm
04-25-2011, 12:54 AM
Many thanks!

Another false positive I encountered in several retail epubs:

0012001600

from the cover page

<div class="centerImage1"><svg:svg height="100%" viewBox="0 0 1200 1600" width="100%">" ........

ldolse
04-25-2011, 04:29 AM
Many thanks!

Another false positive I encountered in several retail epubs:

0012001600

from the cover page

<div class="centerImage1"><svg:svg height="100%" viewBox="0 0 1200 1600" width="100%">" ........

Probably simplest just to strip all the tags with a simple regex sub like <[^>]+> in the plugin to rectify this. Writing regexes that attempt to distinguish between being inside and outside and outside a tag are not a reliable way to go, easier to strip them when there's no requirement to maintain them.

capnm
04-25-2011, 11:49 AM
And this wasn't found:

(From p4, including cover)
<p class="copyright" id="man0019"><span class="center"><span class="smallcaps">ISBN-13</span>: 978–0–141–97081–3</span></p>

Which isn't a big deal, since it is easy enough for me to pull up the odd one the plugin doesn't extract, but left me curious, since it looks like one it would extract ...

capnm
04-26-2011, 08:02 PM
That embedded SVG in the cover page is really annoying.
It turns out that, for example, an svg of a big fat penguin includes a numeric string that could be an ISBN ....
And for some reason I have a bunch of books with a big fat penguin on the first page :)

kiwidude
04-27-2011, 03:58 AM
The new version of this plugin should cater for the HTML looks like an isbn problem as will implement the suggestion from Idolse. However it is dependent on some new code Kovid has added for 0.7.58 so I am waiting for that to be released first. So just a couple more days.

That other scenario you posted in #94 looks interesting, will have to try to replicate it. If I cannot, any chance you could pm me a link to the document somewhere?

kiwidude
04-27-2011, 05:43 AM
And this wasn't found:

(From p4, including cover)
<p class="copyright" id="man0019"><span class="center"><span class="smallcaps">ISBN-13</span>: 978–0–141–97081–3</span></p>

Which isn't a big deal, since it is easy enough for me to pull up the odd one the plugin doesn't extract, but left me curious, since it looks like one it would extract ...

I found the issue after some head banging - it is a very subtle variation of the "dash". This will be handled in the next version.

drMerry
04-27-2011, 02:38 PM
So to answer your amazon question (in this tread I think) here a part of a book in my collection.
The game is open again:

This story and the other stories in the volume are available at:
http://craphound.com/overclocked
You can buy Overclocked at finer bookstores everywhere, including Amazon:
http://www.amazon.com/exec/obidos/ASIN/1560259817/downandoutint-20

kiwidude
04-27-2011, 03:02 PM
drMerry - I have absolutely no idea what your "amazon question" post is in relation to, but it sure doesn't look like anything related to extract isbn?

drMerry
04-27-2011, 03:07 PM
http://www.mobileread.com/forums/showpost.php?p=1493003&postcount=75

Your own idea :D

kiwidude
04-27-2011, 03:12 PM
Ok, at the risk of looking really stupid (again) I still have absolutely no idea of what your post is in connection to - am I alone in being utterly confused? :headscratch:

drMerry
04-27-2011, 03:57 PM
Allright,

let me rewrite my previous post:

Hi kd,

You remembered your question about amanzon (adding amazon id's?)
I found a book in my collection that contains the id.
So the game is open again.
Information on first page of the book:

This story and the other stories in the volume are available at:
http://craphound.com/overclocked
You can buy Overclocked at finer bookstores everywhere, including Amazon:
http://www.amazon.com/exec/obidos/ASIN/1560259817/downandoutint-20


So you can now update your tasklist like you wanted in this post (http://www.mobileread.com/forums/showpost.php?p=1493003&postcount=75)

Made myself clear I hope? (I'm thinking faster than my shadow :D) So I do not alway type everything that is on my mind ...

kiwidude
04-27-2011, 04:05 PM
Ahh. Right. Yes, sometimes a longer explanation would save much confusion - particularly when I am juggling so many streams of development... :)

In direct answer to this, I already abandoned the idea of extracting ASIN - as there are two few books that have this to justify the effort (particular given there is no pattern to the number itself). I will leave it up to the Amazon plugin in Calibre 0.8 to populate it.

The new Extract ISBN is ready for release now, much nicer with it running in the background. I will release it when 0.7.58 comes out as mentioned above, anyone running from source is welcome to PM me their email to test it.

kovidgoyal
04-27-2011, 04:21 PM
@kiwidude: Be aware that on windows having a background process that is opening files for reading in the calibre library is not a good idea. For instance, if the user tries to change metadata for a book while the background process has the file open for reading, calibre will try to copy that file to the new location based on the new metadata and windows will barf.

kiwidude
04-27-2011, 04:35 PM
Thx Kovid, yeah I have seen that problem in the past (not with this plugin, with other scenarios where the file was locked like Microsoft LIT reader from memory).

It will just have to be part of the "usage" warnings for this plugin I think as without a pessimistic locking mechanism in place I obviously can't really control what records the user edits while it is running. It should only be if they change the author or title (or delete the book) though right?

kovidgoyal
04-27-2011, 04:42 PM
change author/title/delete/run a conversion.

kiwidude
04-29-2011, 04:48 PM
Changes in this release:

Do all scanning as a background job to keep the UI responsive
Remove all interactive UI options - it will now always scan all formats in preferred order
Make sure that ISBN-13s start with 977, 978 or 979 (configurable).
Exclude the various repeating digit ISBNs of 1111111111 etc.
Exclude all html markup tags to prevent issues like the svg sizes being picked up as ISBNs
Include endash and other dash variants as possible separators
When scanning PDF documents, scan the last 5 pages in reverse order so it is the last ISBN found
Configuration option for ISBN13 prefixes and option to show updated books when extract completes

As has been mentioned several times in this thread - now that this version will run in the background as a Calibre job do not change the metadata or do a conversion for any of the books you selected to extract ISBN from.

drMerry
05-05-2011, 04:18 PM
A new sort of ISBN:

978-xxx-xxx-xxx^C

so a ^ just before the check digit.
Interesting one. Just seen once

ISBN-13: 978-0-451-46121^6 (alk. paper)
ISBN-10: 0-451-46121-5 (alk. paper)

(also interesting, the ISBN-10 is a 'normal' one

kiwidude
05-05-2011, 04:25 PM
@drMerry - is that a retail book or a scan error? It does look very odd.

drMerry
05-05-2011, 07:36 PM
It IS odd.
It is a scan of a retail book.
I do not think it is an error, but I will look if I can find the a image of the book.

One other thing.
While you scan at background, there is a new problem.
I scanned 18000 (yes 18000) books at once.
No problem.
On details I could see a lot of numbers found.
@79% calibre crashed.
And no isbn was saved.
A fix to this would be to save direct (performance issue) or maybe to save it in a temp-file that you could look for next time you start calibre / the plugin.

But maybe I'm the only one with this problem.

saddan
05-06-2011, 01:29 AM
Thanks for this plugin!

I had problems with some books. One of them I would get this exception:

XMLSyntaxError: PCDATA invalid Char value 24, line 159, column 54

After some print statements, I noticed the xml generated in function _read_pdf_text from file scan.py had some invalid characters.

So I modified it to replace most of non-printable chars by something else ('_').

I'm attaching a diff of the modifications I did.

70959

kiwidude
05-06-2011, 05:31 AM
@Saddan - any chance you could PM me a link to the file that caused this error? I will delete it when done if it is commercial or whatever. I appreciate you taking the time to volunteer a fix, though there are some other functions in Calibre I could possibly use so if I could test with a file or two it would help greatly.

@drMerry - 18,000? Sigh. You wouldn't try to convert 18,000 books at once - Extract ISBN is doing a lot of the same steps underneath so I am not surprised it died. I cannot update ISBNs from the background as the database updates are single threaded, that is why it works the same way as metadata downloads in 0.8. Trying to "resume" a job after a Calibre crash is way more work than I can be bothered with to be honest, I would rather just not have it crash in the first place.

Did you see memory usage climbing before it crashed? The most likely explanation is a memory leak somewhere.

kiwidude
05-06-2011, 08:44 AM
I've had a quick look at the "bulk extract" and there is a memory leak issue going on. However it is inside the Calibre code in the converters (non-pdf) as far as I can tell - certainly extracting from LRF files (which is horribly slow) the leak is pretty nasty and noticeable.

I'm sure at some point Kovid and co may take a look into this - in the meantime stick to extracting ISBN from small batches at a time and you will be fine. The other option I have is to run the extraction in separate worker processes like bulk conversions do. It would be a little slower probably but at least this problem should disappear and the GUI hopefully wouldn't choke every now and then like it does currently.

drMerry
05-06-2011, 09:03 AM
@drMerry - 18,000? Sigh. You wouldn't try to convert 18,000 books at once - Extract ISBN is doing a lot of the same steps underneath so I am not surprised it died. I cannot update ISBNs from the background as the database updates are single threaded, that is why it works the same way as metadata downloads in 0.8. Trying to "resume" a job after a Calibre crash is way more work than I can be bothered with to be honest, I would rather just not have it crash in the first place.

Did you see memory usage climbing before it crashed? The most likely explanation is a memory leak somewhere.

I love to look at (over) the edge of possibilities.
And of-course you should want to convert / run it in small parts, but if I had, this leak would not be noticed at this point :D.
I did not check the mem, but I see you did it.

A (rather easy?) way to 'catch' this error (partly) would be to do the following:
Create a silent-run function. This function would run a scan and silently apply changes.
You than could use this function to create your own batch, and have a list of files todo whom you run in batches of 100, 1000, user-selected number of files at once. After each batch, apply changes, remove files from todo.

kiwidude
05-06-2011, 10:59 AM
Changes in this release:

Strip non-ascii characters from the pdfreflow xml which caused it to be invalid
Support the ^ character being part of the ISBN number
Attempt to minimise any memory leak issues caused by this plugin itself


Note that as per my post above there are still memory leak issues with some of the "conversions" that get run in the background. I've made sure that the plugin releases all file handles and resources that it creates asap so anything else is in the Calibre code. To try to work around that will require greater changes to this plugin than I want to make at this point but I will likely revisit it in future once I finish another new plugin that will work in a similar way.

drMerry
05-06-2011, 11:22 AM
hmm

error on install

calibre, version 0.7.59
ERROR: Exceptie niet opgevangen: <b>OSError</b>:[Errno 2] No such file or directory

Traceback (most recent call last):
File "calibre_plugins.plugin_updater.dialogs", line 568, in _install_clicked
File "calibre_plugins.plugin_updater.dialogs", line 717, in _download_zip
File "site-packages\calibre\ptempfile.py", line 60, in __init__
File "tempfile.py", line 293, in mkstemp
File "tempfile.py", line 228, in _mkstemp_inner
OSError: [Errno 2] No such file or directory

kiwidude
05-06-2011, 11:25 AM
Works fine for me... you sure you haven't run out of space or something with all your temp file issues? Try it again.

drMerry
05-06-2011, 11:29 AM
Works fine for me... you sure you haven't run out of space or something with all your temp file issues? Try it again.

I'll keep trying, no success yet.
13.2 GB free, How big is this new version? :rofl:

kiwidude
05-06-2011, 11:38 AM
Do the usual checklist - is antivirus or something else blocking the downloads etc. Do other plugins install ok. Does it work if you install it manually etc.

drMerry
05-06-2011, 11:43 AM
On my second pc, I can now use the plugin.
I can see there is some mem-usage improvement

Thanks for this quick fix!

(first pc has no problem with other plugins, strange....)
EDIT:
Works
Don't know what the issue was..

wolfelric
05-08-2011, 11:09 AM
Awesome plugin, been hoping for this.

crivicris
05-13-2011, 03:39 AM
I have just discovered plugins for my calibre, and this one is a must for me. Thanks

kiwidude
05-13-2011, 12:48 PM
Thanks crivicris and welcome to MobileRead. There will be a new version of this plugin at some point, so if you havent already I suggest installing the Plugin Updater plugin to make it easier to keep up to date and install other plugins that take your fancy...

xXTGMKXx
05-14-2011, 05:28 AM
Hi, I am new to this forum. I searched far and wide for a better program than Calibre and found none. Despite issues I found troublesome, the steady stream of updates solved them in a timely manner. Kudos to the development team. Now that I have discovered plugins, I thought I would contribute my observations to the development of this key aspect of the program.

First of all thank you to the author of ExtractISBN. I am in full agreement with an earlier poster that the edit metadata window should have a button utilizing this plugin on a piecemeal basis. The plugin performs exceptionally well on a single file and this is how I prefer to update my catalog... since choice of covers [and the alteration of some metadata] is so subjective.

I download collections of books, it is a compulsion. I am branching out as quickly as I can organize them with Calibre. Your plugin has been instrumental in this regard. I would like to share my experience with it (ExtractISBN 1.3.1 - Windows Vista [I know] - 1GB Ram - Calibre 0.8.1).

With the plugin set to run on a small collection of 6700 - the progress seems to slow to a crawl on a linear curve until the UI hangs. For example 1 book extraction is instantaneous - 100 is 8 minutes - 250 is 30 minutes - 500 is one hour 15 minutes - 1000 is 3 hours and beyond that honestly I haven't had the patience... the UI is unresponsive for increasingly long periods of time. If you could fix this issue I would be ever so thankful. I'm not a good programmer by any means... but I have an idea... Is it possible (I could be wrong by a wide margin here so be kind) that you save the results of the search to memory... and that instead a hard-file could be updated after each successful hit... and that at the end of the job the file referenced for application of changes? With that I don't see how there would be any discrepancy between the extremely short runtime of one file and the runtime when deep in to a collection. Like I said... I suck at coding... if it doesn't work... at least I've raised the issue.

Keep up the good work, bibliophiles and digital hoarders everywhere are in your debt!!

kiwidude
05-14-2011, 07:30 AM
Welcome to MR and thanks for your post...

This issue has been discussed recently in this thread. The problem is caused by some nasty memory leaks inside the calibre conversion code that this plugin calls to get a standard format that it can scan for the isbn.

The simplest solution to this is to follow the same approach that doing conversions does and run the conversion and scan in an external worker process executable. So after each conversion the memory contents are completely released. Currently my approach has been to run as a separate thread inside the calibre exe like metadata downloads do, however this means that memory leaks and cannot be reclaimed without restarting calibre.

However I cannot make this change without changes to the calibre api. Currently it is not possible from a plugin to create jobs to run on an external process, as the list of known "things to do" that the worker executable understands is hard coded currently. It needs some extra code to allow being passed some info about calling code in a plugin.

I have asked Kovid to make this change, as there is likely other code changes that could be made to give me more reusable code that I could use to. He has only just returned from holiday so hopefully it might get done this week and then I can start rewriting this plugin.

The only other option is to fix the memory leaks. However having helped Kovid track down some memory leak issues in the metadata download over a 5 hour period one Sunday night I know just how painful and difficult this is. Plus it could well be that the issues lie in some library calibre calls or whatever. And multiply that out over the dozens of format converters and you can see why the simplest solution is to use the code the same way calibre does.

Glad you are finding the plugin useful, but in the meantime keep your batches small and use ctrl+R to restart calibre periodically when you see the impact.

xXTGMKXx
05-14-2011, 08:11 AM
The only other option is to fix the memory leaks. However having helped Kovid track down some memory leak issues in the metadata download over a 5 hour period one Sunday night I know just how painful and difficult this is.

Thanks for your timely and well-written reply! I understand fully the problems you outlined. I am completely fine with waiting however long it takes for optimization, after all the plugin already works like a charm - not everyone will find keeping extraction batches under 1000 a problem, lol.

I am definitely a loyal Calibre user... nothing like it... so no complaints from me about waiting. On the same note, contributions from plugin developers are just as significant to my loyalty as the viability of the main platform itself.

In the meantime, I have but one more humble suggestion... which floated from the ether overnight. How about creation of a new tag within/for use with ExtractISBN. Basically the antithesis of identifier_updated; something like extract_failed - to allow marking/sorting ([extract_failed:false & identifier:false] as a sort method to select a new batch for extraction) of documents which ExtractISBN returned negative. I find myself rehashing the same files... with little in the way of keeping track. I suppose it wouldn't have to be persistent. It could have a half-life... or perhaps the value resets when calibre restarts. Or could be batch reset with a command when no longer needed. Hell... as long as the failed files are marked until the next invocation of ExtractISBN... then those files could be called and copy-deleted to a container library to get them out of the way. That would cost time, but would technically be more efficient than the process is at the moment. Just an idea.

Anyway, I must apologize for raising an issue previously discussed. I only skimmed the thread. On the other hand I was taught it never hurts to ask.

Viva Calibre! :thumbsup::2thumbsup:iloveyou::thanks:

xXTGMKXx
05-15-2011, 06:46 AM
How about creation of a new tag within/for use with ExtractISBN
Well I found a quick-fix... user-added column with a yes/no configuration.
I still think it's a good idea... but thought I'd throw my solution out there for those with the same problem.

kiwidude
05-15-2011, 06:53 AM
Well I found a quick-fix... user-added column with a yes/no configuration.
I still think it's a good idea... but thought I'd throw my solution out there for those with the same problem.
Can you not just use a search of:
isbn:false

When I added the ability to temp mark the ids that were updated, I did consider an option in the dropdown to show those that failed and in fact had it coded but ripped it out before release. I didn't include it for two reasons:

The first is that there is overlap with isbn:false. Of course isbn:false is all of your database, and not related to your selection you did the extract on.

The second is the definition of "failure". Does failure mean that it could not find an ISBN by scanning? What if the book already had an ISBN?

Or does it meant that the book was not updated with an ISBN (it might have found one but if matched an existing value on the book so did nothing).

It gets a bit murky. If we can agree a definition that would be "useful" then I can put it in a future release in that dropdown of the configuration screen for the plugin. My guess would be that you are only going to be interested in books that still do not have an ISBN from the set that you scanned?

rloveking
05-15-2011, 11:42 PM
I apologize if this has been covered before, I did not read all of the prior posts in this forum before posting. I have been LOVING this plugin. But I'm having problems with some of my .lit files. Here is an example:

isbn found by Extract ISBN: 2360011111

ISBN part of the .lit file:

All rights reserved.

ISBN: 0-425-20743-9

BERKLEY SENSATIONŪ

I don't see how one comes from the other. And I didn't find anywhere in the .lit file that has "236" anywhere in it.

I have 98 books (I believe all .lit) that have this "Wrong" ISBN number. All that I have looked at so far seem to have a correct ISBN within them (like above).

Any ideas?

- Becky

xXTGMKXx
05-16-2011, 12:12 AM
My guess would be that you are only going to be interested in books that still do not have an ISBN from the set that you scanned?

Precisely... instead of only marking books where an isbn was found (isbn:updated) my idea was to mark books where an isbn was scanned for but not found.

Now that I think about it though... it is as murky as you thought. Since search parameters would start to confuse each other. I think my solution of a yes/no column is more elegant... if you could somehow change your plugin to create a yes/no marker... let's call it "Extracted" and mark those updated with a checkmark, and those failed with an x, that would be pretty elegant. By that logic, you could still have the option to view the updated isbns at the end of the job - and you could also leave the user the option to search on their own terms... for example "identifiers:false & extracted:false" would return a clean list of documents yet to be scanned.

If all this is impractical, though... I highly recommend my user-column solution to those that come after me. It's rather easy to search identifiers:false - highlight a selection - extract that selection - bulk metadata change the selection to true (then the documents with identifiers disappear from the list) - bulk metadata change the remaining documents to false - then do a search for customtag:true. That way, a search of identifiers:false can be sorted by the customtag column... the false documents would be easily identifiable.

kiwidude
05-16-2011, 03:40 AM
I apologize if this has been covered before, I did not read all of the prior posts in this forum before posting. I have been LOVING this plugin. But I'm having problems with some of my .lit files. Here is an example:

isbn found by Extract ISBN: 2360011111

ISBN part of the .lit file:

All rights reserved.

ISBN: 0-425-20743-9

BERKLEY SENSATIONŪ

I don't see how one comes from the other. And I didn't find anywhere in the .lit file that has "236" anywhere in it.

I have 98 books (I believe all .lit) that have this "Wrong" ISBN number. All that I have looked at so far seem to have a correct ISBN within them (like above).

Any ideas?

- Becky
Becky,

Can you please PM me a link to one or two of your books that have this problem so I can take a look?

dm101
05-17-2011, 10:18 AM
I apologize if this has been covered before, I did not read all of the prior posts in this forum before posting. I have been LOVING this plugin. But I'm having problems with some of my .lit files. Here is an example:

isbn found by Extract ISBN: 2360011111

ISBN part of the .lit file:

All rights reserved.

ISBN: 0-425-20743-9

BERKLEY SENSATIONŪ

I don't see how one comes from the other. And I didn't find anywhere in the .lit file that has "236" anywhere in it.

I have 98 books (I believe all .lit) that have this "Wrong" ISBN number. All that I have looked at so far seem to have a correct ISBN within them (like above).

Any ideas?

- Becky

Dear kiwidude,
I have exact the same problem with many ebooks, but i have pdf.
Greets dm101

kiwidude
05-17-2011, 10:24 AM
@dm101 - until becky sends me a link to an example book there is nothing much I can do about it. There was always going to be a risk with loosening the regex to not search for specific text of one of the many variations of "ISBN" before it that this situation could arise.

If you want to PM me a link to the pdf then that would help, though this will likely not be exactly the same issue as becky has and so I still would need a file from her.

kiwidude
05-17-2011, 02:48 PM
Thx @dm101 for the files. I can see what the problem is (and indeed this is probably becky's issue as well - you mentioned pdf which is why I thought it may be different but it is an ePub you sent that showed the issue). It is when you have a file with those annoying embedded font-face declarations at the top like this:

<style type="text/css">
@font-face {
font-family: Courier;
panose-1: 2 7 4 9 2 2 5 2 4 4
}

I've never understood the point of these (and rip them out of my own ePubs). Obviously with enough of them in there the chances of hitting a number that coincidentally looks like an ISBN is higher.

I already have some code in there that rips out HTML tags. I will tweak that a bit to make sure these get ignored as well when evaluating.

kiwidude
05-17-2011, 03:23 PM
Changes in this release:

Strip the <style> tag contents to ensure panose-1 numbers are not picked up as false positives


Hopefully this should resolve some of the problems reported above of false positives on ISBNs.

dm101
05-17-2011, 05:05 PM
@kiwidude,
thank you very much for this update :-)
greets dm101

kiwidude
05-19-2011, 07:19 AM
Changes in this release:

Ensure stripped HTML tags replaced with a ! to prevent ISBN running into another number making it invalid


I found a situation where the decision to strip all html tags in 1.3 caused some ISBNs not being detected. It is because the raw html had a <br/> tag shielding the ISBN number from the next line, which if it coincidentally happened to start with a number meant that the two numbers got merged together. As the combined length was not valid for an ISBN the number would get thrown away. This release fixes that problem.

theaccountant
05-20-2011, 05:11 AM
I second xXTGMKXx idea!


"Now that I think about it though... it is as murky as you thought. Since search parameters would start to confuse each other. I think my solution of a yes/no column is more elegant... if you could somehow change your plugin to create a yes/no marker... let's call it "Extracted" and mark those updated with a checkmark, and those failed with an x, that would be pretty elegant. By that logic, you could still have the option to view the updated isbns at the end of the job - and you could also leave the user the option to search on their own terms... for example "identifiers:false & extracted:false" would return a clean list of documents yet to be scanned"

Is the any way that after the ISBN is extracted that it could be preserved and not oevrwriiten when downloading metadata? My 10 digits numbers and being replaced but 13 digit codes.

Also would it possible to have this as an external tool?

Thanks

kiwidude
05-20-2011, 06:12 AM
I second xXTGMKXx idea!
The problem is that any value I applied could only be temporary and would not survive a calibre restart. And the set of data being marked would give you different results every time you ran the scan against a different selection as it would have no memory of books you have scanned previously. At the moment I only mark the books that were updated from your selection. I could add marking of books that were scanned but it doesn't seem very useful. I could add marking books that were not updated but as I said above that is complicated by the reasons why they weren't updated.

There seems to be several different questions related to this floating around between the various posts. As looking at what books in your library that do not have an isbn and that you haven't scanned as yet is a totally different requirement from books that you just scanned but could get no isbn from. So until I see some clarity on what it is you are trying to achieve I am not going to change the current behaviour.

My usage of extract isbn is pretty simple as I just add a bunch of books, select them and extract then run download metadata. I don't get hung up on having an exact value from the book, to me it is just a tool to increase the chances of metadata download picking the right book. So if extract isbn fails I don't care so long as title author search gives me the right metadata result. And if that fails then I use my goodread sync plugin for it's link book feature to search Goodreads website for a more useful edition, drag drop the URL back onto the linked book dialog and have that plugin configured to overwrite the isbn. So then I can just fire the metadata download again.

Is the any way that after the ISBN is extracted that it could be preserved and not oevrwriiten when downloading metadata? My 10 digits numbers and being replaced but 13 digit codes.
No. That isn't anything to do with this plugin, that is just the way the metadata download works, and it prefers 13 digit isbns.
Also would it possible to have this as an external tool?

Thanks
No. Why would you want to, how are you going to tell it which books to scan?

However you could take a look at the scripts thread linked from the first post in this thread. Of course those scripts have very different internal code to what this plugin now does. I originally based the plugin on logic in one of the scripts but found a number of issues with it so since then it works very differently in it's approach to identifying isbn values in general as well as for better and faster PDF scanning.

theaccountant
05-20-2011, 08:53 AM
Thanks for your response.

[QUOTE=kiwidude;1546108]The problem is that any value I applied could only be temporary and would not survive a calibre restart. QUOTE]

If the plugin created and used a 2nd column called ISBN Output. The output from the isbn extractor "show details " could be posted to the new column.
ISDN matched
ISBN not Found
ISBN Number result

In otherwords the results from the plugin would be posted to a new field in the database as well as to the ISBN field.

Then the results would be permanent in the DB and would survive a restart.

kiwidude
05-20-2011, 08:57 AM
Yes I could have mentioned that a custom column would be a "permanent" solution. However I am reluctant to go down that route as we are talking a (in my opinion) extremely niche requirement. Very few users will be bothered with cluttering their view with yet another column that records something as trivial as whether they have run the extract ISBN plugin on a book. And for the sake of a couple of extra clicks, you can do this yourself manually.

kiwidude
05-21-2011, 06:13 PM
Changes in this release:

Run the ISBN extraction out of process to get around the memory leak issues


This release requires Calibre 0.8.2

I have decided to keep both methods of scanning (threaded job versus worker job) optionally available in the plugin. There is now a configurable threshold at which it will switch between them. By default this threshold is set to one selected book. So if you select just one book, the scan will run as a threaded job as per the changes I made for 1.3. This is the fastest way to get an ISBN, but will continue to suffer from the known memory leak issue if you scan hundreds of certain book formats over a long period of keeping Calibre open.

If you select more than one book, then the scan runs as a worker job, just like book conversions do. This would be a little slower for just a single book but faster overall if you select higher numbers of books at once. This method will not suffer from the memory leak issue.

You can adjust the threshold on the plugin configuration screen as per the screenshot.

EDIT: I left some debug code in the plugin which a couple of you had downloaded before I caught it - if you were one of the first two downloaders please just download it again.

drMerry
05-22-2011, 12:24 PM
just one problem with the reverse lookup funciton.

One of my books has this info on one of the last pages:

ISBN

9789077740798 (ebook)
9789077740606 (gedrukte uitgave)

You implemented a function to do a reverse lookup for the last pages (in reaction on my question related to books with read also isbn numbers (and ISBN for current document as last one)).

The numbers above are two numbers of the same book. First is the number for ebooks, second for hard copies.
So in this case I would like to get the first number. But the plugin should give me the second because of stated implementation.

Curious enough, the first number is returned nice in this case but not desired in most others.

Starting job: Extract ISBN for 1 books
Running scan for isbn query with parameters:
{u'paths': [(u'EPUB', u'H:\\Local (swart)\\Madelon Schoemaker\\Spanje Voorgoed (7546)\\Spanje Voorgoed - Madelon Schoemaker.epub')], u'timeout': 30, u'title': u'Spanje Voorgoed'}
Scanning: H:\Local (swart)\Madelon Schoemaker\Spanje Voorgoed (7546)\Spanje Voorgoed - Madelon Schoemaker.epub
Valid ISBN13: 9789077740798
Valid ISBN13: 9789077740606
Scan time: 4.80999994278 Spanje Voorgoed
The isbn was found in 4.81 seconds
New ISBN extracted of 9789077740798 for Spanje Voorgoed
Scan complete, with 0 failures

xXTGMKXx
05-22-2011, 12:47 PM
looking at what books in your library that do not have an isbn and that you haven't scanned as yet

First of all, there's your clarification.

Second of all, I understand it's a niche. As a matter of fact I wouldn't even expect to use it much after I've sorted my 100,000 or so. However for any amount over 500... this is useful... and I know for a fact I'm not the only person who downloads large collections.

On the other hand, your argument of cluttering the view is flawed... I actually hid my custom #extracted column. I only need to know it's there for searching purposes.

Finally, as I said before... I've solved the problem from my perspective. I'm not on some sort of crusade to change YOUR EXTREMELY USEFUL AND APPRECIATED PLUGIN. I'm glad someone agreed with me, and I would like to point them back to my assertion that a custom, as needed solution is perfectly suitable. Anyway, I consider the case closed unless you want to contact me further on the issue.

Thanks again for the brilliant automation tool! I wish I had a credit card to activate my paypal account, I'd drop you a fin for your contribution.

All the best,
Matt

kiwidude
05-22-2011, 12:47 PM
Reverse lookups only take place for PDFs.

drMerry
05-22-2011, 02:35 PM
Reverse lookups only take place for PDFs.

That's it. My file is an ePub.
Thanks!

(and it IS a great plugin! ;) )

dm101
05-23-2011, 11:30 AM
Hi kiwidude,

yes it's a very helpful tool, and the newest modification is great, calibre will not break down :-)

could you please add the option:
"delete existing isbn, if no isbn was found"

because of using the old version of your plugin, i have much isbn numbers, that will not match the books....

thank you
gereets
dm101

theducks
05-23-2011, 11:48 AM
Hi kiwidude,

yes it's a very helpful tool, and the newest modification is great, calibre will not break down :-)

could you please add the option:
"delete existing isbn, if no isbn was found"

because of using the old version of your plugin, i have much isbn numbers, that will not match the books....

thank you
gereets
dm101

That is a very bad idea :eek:

Not all books in the library may have had an ISBN included within the document :smack:
That does not make the ISBN you have in the metadata, incorrect. (nor, make correct ;) )

dm101
05-23-2011, 12:43 PM
in my library are only isbn numbers extracted with this plug-in (in an older version), and now i have much wrong numbers.
i wish only a checkbox for deleting the isbn numbers that not exist in this document.
if you don't want to use this option, you will not activate the checkbox.....
greets
dm101

kiwidude
05-25-2011, 03:00 PM
Changes in this release:

Add yet another unicode variation of the hyphen separator to the regex


Thanks to dm101 for sending me the PDFs to try this on. You would think there are only so many variations of the separator that could be used between numbers that all look the same to the naked eye...

capnm
06-08-2011, 12:45 AM
1)
When I run this on a single epub, and it finds an isbn that matches the existing metadata I get this error message:

Job: "Extract ISBN for 1 books" failed with error:
Traceback (most recent call last):
File "site-packages\calibre\gui2\threaded_jobs.py", line 83, in start_work
File "calibre_plugins.extract_isbn.jobs", line 81, in extract_threaded
AttributeError: 'set' object has no attribute 'append'

Called with args: ([100], <calibre.library.database2.LibraryDatabase2 object at 0x05186170>) {u'notifications': <Queue.Queue instance at 0x11F8FF80>, u'abort': <threading._Event object at 0x0BD3C9D0>, u'log': <calibre.utils.logging.GUILog object at 0x0BD3C0B0>}



Not a big deal since the extract works ok ....


2)

It's not finding the ISBN in several epubs.
This doesn't seem to be yet another dash, they're using 2D dashes.
They are in the last couple of lines of the last htm if that might have something to do with it.

Two examples:

<p class="crt">eISBN: 978-0-375-89036-9</p>

<p class="crt">v3.0</p>
</div>
</body>
</html>


<p class="center"><strong>eISBN: 978-0-307-54803-0</strong></p>

<p class="center"><a class="pubhlink" href="http://www.vintagebooks.com">www.vintagebooks.com</a></p>

<p class="center">v3.0</p>
</div>
</body>
</html>



3)
A side question-
Is there an easy way to grab the ISBN (or other fields) from the embedded metadata?
The only things I've come up with are:
Reimport the epub or
Open in Sigil, edit metadata,cut/paste
both pretty cumbersome

Thanks!

kiwidude
06-08-2011, 03:36 AM
@capnm - please pm me a link to your file for #2.

I will look into #1, that looks like a simple bug to fix, strange no one else has noticed it.

Re #3 - the other way without Sigil is to use the ebook viewer which you could similarly copy the data from, not much difference in steps though. I don't know of anything else beyond what you have listed.

Ersatzreifen
06-08-2011, 07:11 AM
:help:
I downloaded this plugin and tried to install it, but got this error:

calibre, version 0.7.45
ERROR: Unhandled exception: <b>InvalidPlugin</b>:No valid plugin found in /home/russ/Downloads/Extract ISBN.zip
Traceback (most recent call last):
File "/usr/lib64/calibre/calibre/gui2/preferences/plugins.py", line 280, in add_plugin
plugin = add_plugin(path)
File "/usr/lib64/calibre/calibre/customize/ui.py", line 377, in add_plugin
plugin = load_plugin(path_to_zip_file)
File "/usr/lib64/calibre/calibre/customize/ui.py", line 93, in load_plugin
raise InvalidPlugin(_('No valid plugin found in ')+path_to_zip_file)
InvalidPlugin: No valid plugin found in /home/russ/Downloads/Extract ISBN.zip
Ok, how to install?

Ersatzreifen
06-08-2011, 07:27 AM
Update:

I just tried to install a different plugin, and got the same type of error.
I restarted Calibre and tried again. No joy.

kiwidude
06-08-2011, 07:35 AM
Look at your calibre version as it is way too old for this plugin. Upgrade calibre then try again.

kiwidude
06-08-2011, 08:38 AM
@capnm - thx for the epub. There is no issue to do with nearness of the ISBN to the bottom of the page. In fact Extract ISBN does return a 10-digit ISBN when I run it. What it doesn't do is return the ISBN that you want - instead it returns the first one it finds, which is a few pages before that and refers to an audio edition of the book.

Were this a PDF, then it would be picking up the correct ISBN, as it checks the final pages of a book in reverse order.

Maybe it is time I try to apply that similar reverse scan logic to formats other than PDF, as you are not the first person to comment on it. The problem is that unlike a PDF, the way I access the text in an ePub is by iterating through the spine (manifest) of the book. So there is no concept of "pages", only of "files". Depending on how well the book is split, the last few "pages" might be in one file or in multiple files, in fact the whole book could be in one file. It is for the same reasons that I cannot apply the same logic of scanning only the first 10 "pages" like I do with PDFs.

So it all gets a bit messy and crude. Maybe I shall make it that I scan the very last page in reverse order first, and then scan the rest of the book in normal order.

I fixed the other bug you reported btw, as you said it doesn't really impact the functionality as such which is why no-one else noticed it but nice to get rid of the error nonetheless.

capnm
06-08-2011, 08:55 AM
Oh ....
I guess I was confused by the fact the [desired] ISBN didn't show up in the log.
I thought you listed all the potential ISBNs found.

Rather than attempt messy page parsing, how about just preferring ISBN-13s to ISBN-10s when both are found?

(Actually I thought you were already doing that, looking at the logs ...)

kiwidude
06-08-2011, 09:05 AM
@capnm - it does already do that. But the logic currently is to stop scanning on the page/file that it finds its first valid ISBN (parse that whole page before stopping). The assumption is that if a book had both a 10-digit and 13-digit ISBN, that they will both be in the same page/file and hence the ISBN13 will be selected it it exists on there. And as is the case here, that if the ISBNs were spread across different pages/files, that they refer to different books. It is not uncommon for books to have ISBNs for other books - just like in this case you have an ISBN for an audio edition.

capnm
06-08-2011, 10:21 AM
the logic currently is to stop scanning on the page/file that it finds its first valid ISBN (parse that whole page before stopping)

Ahhh ... that's part I didn't get.


I'm pretty darn happy with the plug-in as is. I'm not sure how much you can start second-guessing the layout :)

You'll never avoid the books with "further reading" isbn lists, etc.

Maybe include an option to search only for 13s?

That would also quench some of the 800 number false positives I've noticed, but not worried much about. (They're easy enough to spot).

edit:
Better - config option to only return 10s if no 13 is found.

kiwidude
06-08-2011, 11:44 AM
I've made the change to scan the last two files of non-PDF books in reverse (if the book has more than two pages). As you say it is a lottery, but I know it is common to put ISBN at the back of the book so sometimes you will get lucky.

I'm making some other changes regarding logging which will be dependent on the next Calibre release, so I won't release it until Friday.

kiwidude
06-12-2011, 02:19 PM
Changes in this release:

Fix bug occurring when same ISBN extracted for a book
For non PDF file types, based on #files in books scan first x files, last y in reverse then rest
When scan fails, still give option to view the log rather than standard error dialog


Note that this requires Calibre 0.8.5 or later.

Philosopher
06-13-2011, 03:51 AM
Is it possible to integrate this plug-in with the Jobs indicator to monitor its progress and know that it is still working?

kiwidude
06-13-2011, 05:24 AM
Not any more than it already has, for when you do a batch of books. It cannot report progress within a book.

nynaevelan
06-24-2011, 02:05 PM
Hi Kiwidude:

Since I FINALLY have my library to where I want it to be and all my fabulous plugins are doing what I want/need them to do, it is now time to start playing with some new plugins. :rofl::rofl: I was looking at this one and I was thinking this would be a good tool to check to see if I have the correct isbn assigned to the correct book, however this plugin appears to only download into the isbn identifier field. Is there a way to have it download into a custom column or perhaps you are a regex expert and you could help me with a regex that would take my existing isbn and move it to a custom column which I have for the isbn?? Yes I have now become as obsessed with my ebook library as I am with my digital music library. :eek: :smack:

Nyn

capnm
06-24-2011, 03:52 PM
Is there a way to have it download into a custom column or perhaps you are a regex expert and you could help me with a regex that would take my existing isbn and move it to a custom column which I have for the isbn??


Been doing just that ... it's not really even a regex:

On the search & replace tab of bulk editing metadata -
Search Mode = Regular Expression
Search Field = Identifiers
Identifier Type = isbn
Search For + Replace With = leave these blank
Destination Field = your custom column


And to clear the isbn, as above but:
Search for = .
Destination Field = Identifiers
Identifier type = isbn (here it's not a drop down, but it still works)

nynaevelan
06-24-2011, 05:43 PM
Thanks CapM, I will use that and move my isbn's and then run the script to see if my isbn matches.

Nyn

capnm
06-24-2011, 05:59 PM
Thanks CapM, I will use that and move my isbn's and then run the script to see if my isbn matches.

Which reminds me ... when I was messing around with this I did something like this:

Create 3 custom columns:
#oldisbn
#newisbn
#isbncompare -- a column built from other columns, sort/search by Yes/No
{#isbncompare:'strcmp(field('#oldisbn'), field('#newisbn'), 'No', 'Yes', 'No')'}

Save the existing isbn to #oldisbn
Clear isbn
Run Extract ISBN
Copy the new isbn to #newisbn

browse and observe ...

nynaevelan
06-24-2011, 06:03 PM
And to clear the isbn, as above but:
Search for = .
Destination Field = Identifiers
Identifier type = isbn (here it's not a drop down, but it still works)

I am not sure I understand this part, what do I put in the Replace with field??

capnm
06-24-2011, 06:31 PM
I am not sure I understand this part, what do I put in the Replace with field??

Nothing.

Put a period in the Search for field, and leave the Replace with field blank.

Edit:
Or maybe a space, but I'm pretty sure leaving it empty works.

drMerry
06-24-2011, 06:39 PM
I got an interesting one:

First page of the book:

M.Y.T.H. Inc. Link – Myth 07
By Robert Asprin
1 Another Fine Myth 0-441-02362-2 1978 Ace
2 Myth Conceptions 0-441-55521-7 1980 Ace
3 Myth Directions 0-441-55529-2 1982 Ace
4 Hit or Myth 0-441-33851-8 1983 Ace
5 Myth-ing Persons 0-441-55276-5 1984 Ace
6 Little Myth Marker 0-441-48499-9 1985 Ace
--->M.Y.T.H. Inc. Link 0-441-55277-3 1986 Ace
8 Myth-nomers & Im-pervections 0-441-55279-X 1987 Ace
9 M.Y.T.H. Inc. in Action 0-441-55282-X 1990 Ace
10 Sweet Myth-tery of Life 0-441-00194-7 1994 Ace
11 Something M.Y.T.H. Inc. Not yet released ? Ace

This is taken from book number 7.
If you would process it automatically, the way to go would be to look for the title on the same row as the isbn.

This could be done in case you finde more than one isbn.
At this moment, this is the first book (series) I found with this kind of identification. Don't know if anyone else has seen it before?

nynaevelan
06-24-2011, 06:39 PM
Thanks the period worked. :)

drMerry
06-24-2011, 06:54 PM
And then there it is, a second series:


#01: The Vanishings ISBN 0-8423-2193-4
Page 64
#02: Second Chance ISBN 0-8423-2194-2
#03: Through the Flames ISBN 0-8423-2195-0
#04: Facing the Future ISBN 0-8423-2196-9
#05: Nicolae High ISBN 0-8423-4325-3
#06: The Underground ISBN 0-8423-4326-1
#07: Busted! ISBN 0-8423-4327-X
#08: Death Strike ISBN 0-8423-4328-8
#09: The Search ISBN 0-8423-4329-6
#10: On The Run ISBN 0-8423-4330-X
_________________________

kiwidude
06-25-2011, 05:20 AM
@drMerry - you are dreaming if you think this plugin is going to ever attempt to detect that situation. Remember this plugin was changed to not even attempt to match the single "ISBN" type prefix due to all the ambiguities, they are far worse trying to match on title.

Consider this the same situation as a publisher advertising another book inside - the plugin can never detect that nor will it attempt to. "Most of the time" it will get things right, and sometimes in situations like this it won't, such is life.

drMerry
06-25-2011, 06:44 AM
@drMerry - you are dreaming if you think this plugin is going to ever attempt to detect that situation.

I love dreaming ;)
At this moment I'm not very active on the forum. So I do not know what you are working on at all.
This was just something I found and thought you might want to have it on a wish list for some time. (I found more books that where not right indexed, but this is one version that has a possibility to be solved computer based.)

I think most new functions on this plugin will be 'extended search'. Or, how to chose the most likley right isbn number if more ar found.
That's why I mentioned it.

Maybe, someday I will try to build some of that functions. But for now I fully understand that/why you are not implementing this function.

capnm
06-25-2011, 06:37 PM
Hi kiwidude!
I'm breaking things again ...

calibre, version 0.8.7
ERROR: Unhandled exception: <b>NameError</b>:global name 'question_dialog' is not defined

Traceback (most recent call last):
File "site-packages\calibre\gui2\dialogs\message_box.py", line 199, in do_proceed
File "calibre_plugins.extract_isbn.action", line 130, in _check_proceed_with_extracted_isbns
NameError: global name 'question_dialog' is not defined

I think this was supposed to be the warning/question box about how I had changed metadata after the operation started.

drMerry
06-27-2011, 02:06 PM
I found a new option.
If there is no isbn available, but you can find a 9-digit number, it is possible this is a pre-isbn. A sbn-number.
Adding a 0 in front of this number creates a valid isbn-number.

The original standard book number (SBN) had no group identifier, but affixing a zero (0) as prefix to a 9-digit SBN creates a valid 10-digit ISBN. Group identifiers form a prefix code; compare with country calling codes.

WIKI-isbn (http://en.wikipedia.org/wiki/International_Standard_Book_Number)

E.g. in my library:
DT book used for version 1.1 reference: SBN 4260 5012 6

To get a valid isbn add 0: 04260 5012 6

kiwidude
07-01-2011, 10:09 PM
Changes in this release:

Fix bug of question dialog when metadata has changed not being displayed


Thanks @capnm for reporting this.

drMerry
07-03-2011, 06:21 PM
There is a variant on the sbn version:
sbn: 345-24366-8-150
gives a valid isbn by adding a 0 in front and removing the last numbers to gain a 10 digit number
0345-24366-8

kiwidude
07-03-2011, 06:47 PM
@drMerry - I don't want to do that because it increases further the likelihood of false ISBNs. Perhaps it is a specific publisher or country specific thing, but I haven't (knowingly) come across any books yet that fall into that scenario. Supporting such an edge case is only going to weaken the match quality for the majority of cases.

drMerry
07-03-2011, 06:49 PM
Well, at the moment I have about 1000 books having no isbn but do have a sbn.
So I think there will be more people having it.

When I manually add the sbn as isbn, I often get a match downloading the metadata so it could be a nice option.

drMerry
07-06-2011, 01:12 PM
I got a book where no ISBN was found.
Copy past the ISBN from the pdf to calibre was no problem.
Starting job: Extract ISBN for 1 books
Running scan for isbn query with parameters:
{u'paths': [(u'PDF', u'C:\\local (laptop)\\Onbekend\\Great Book of Puzzles (18786)\\Great Book of Puzzles - Onbekend.pdf')], u'timeout': 30, u'title': u'Great Book of Puzzles'}
-------------------------------
Scanning: C:\local (laptop)\Onbekend\Great Book of Puzzles (18786)\Great Book of Puzzles - Onbekend.pdf
Scan time: 23.503000021 Great Book of Puzzles
The scan failed to find an isbn in 23.50 seconds
Failed to extract ISBN for Great Book of Puzzles
Scan complete, with 1 failures

I'll send the book by pm

kiwidude
07-06-2011, 02:09 PM
@drMerry - my guess from looking at the PDF is the text is behind an image. The PDF conversion engine never picks up the text in that situation, so there is no ISBN to find. I would guess that if you tried to convert that PDF to an EPUB you would find that page was rendered as an image in the EPUB.

drMerry
07-06-2011, 05:55 PM
You're right.
Stupid I did not think of that.
Thanks for looking.

mobilemax
08-08-2011, 04:41 PM
Any chance to add something like "timeout" option to the script? I have had some books where the script just stayed working for hours and it never finished. Would it be possible to say stop the task on current book after a specified time? e.g. 5 minutes maximum?

thanks!

kiwidude
08-08-2011, 04:46 PM
@mobilemax - The problem will not be the time taken to scan, but the time taken to convert to epub (which is calibre code) prior to the scan. You must have a particularly nasty book that Calibre is choking on. As for whether it would be possible to force a timeout, I don't know - I will add it to the list to take a look at one day.

mobilemax
08-08-2011, 04:53 PM
@mobilemax - The problem will not be the time taken to scan, but the time taken to convert to epub (which is calibre code) prior to the scan. You must have a particularly nasty book that Calibre is choking on. As for whether it would be possible to force a timeout, I don't know - I will add it to the list to take a look at one day.

Yep, had quite a few and since i decided to run the whole db through ExtractISBN, it's quite boring to find that it just did not finish "these and those 500 books" and you have to find the bad ones and skip them ;-)

But I still love the script of course! ;-)

Thanks

Btw, is there any way of limiting which formats it will parse? E.g. I have .txt/.epub with the same contents because .epub was created from .txt and it would make sense to skip the .txt to make it quicker...

kiwidude
08-08-2011, 05:21 PM
No way to limit it, nor would many people want to (since unless you do all your own conversions you wouldnt know they were the same exact content. BeI would expect to be pretty quick anyways. It is formats like LRF and graphical PDFs that Calibre chokes on the most.

jlutes
08-08-2011, 05:31 PM
I do find this script useful but it seems to fail on a pretty regular basis. Perhaps it's a problem on my end so let's start with that.
I routinely get a Windows exception error stating:
AppName: calibre-parallel.exe AppVer: 0.8.13.0 ModName: unknown
ModVer: 0.0.0.0 Offset: 025b80b5
Once I see that message I know I'm done and I might as well kill the job. I have let it sit for over an hour and it never will finish. The real kicker is, unless I'm missing something, the ISBNs it did find aren't applied if you have to kill the job. Scanning 500 books and finding out it crashed at 98% just makes my skin crawl.
My question is, what, if anything, can I provide to help find and squash whatever is causing this?

kiwidude
08-08-2011, 06:08 PM
@jlutes - you need to figure out which book and format is causing your issue. My guess would be that is a problem with a PDF since that sort of crash is likely from C++ unmanaged code (which the PDF conversions use). If you find the book causing the crash, attach it with a bug report for Kovid to take a look at. There is nothing I or the plugin can do about this, it is calling existing Calibre code.

jlutes
08-08-2011, 08:03 PM
I went to try and figure out if a certain format was causing the problem and found an even more interesting phenomenon. If I highlight a group of 10 books and run Extract ISBN, I get the error I described earlier. However, I can choose each book individually and run Extract ISBN on each one and it never errors. Are we looking at the same thing? Still a call to existing Calibre code causing the problem?

* Update *
I got a virtual memory warning on my machine (first I've ever seen) and found that there were about 50 Dr. Watson process running and each one of them was tied to a calibreparallel process. After I killed all of them and restarted Calibre it appears that it's attitude has changed. I am still getting Windows Exception errors but they aren't stopping the process.

capnm
08-09-2011, 12:31 AM
Hmmm...
IIRC, running this against just a couple of books it is run as part of the main Calibre process, but select several books (user configurable threshold) and it spawns a background worker process.

Oddly, I had several issues with memory leaks while running as part of the main process, but the spawned background jobs have always been well behaved on my machines. But I'm almost all epub & mobi files.

I wonder what would happen if you raised the threshold in the plugin configuration and tried that same group of 10 as a foreground process instead of as a background process ....

And are your books pdfs? Or ....?

kiwidude
08-09-2011, 02:54 AM
If you read back through this thread you would understand why it is different behaviour between running one versus multiple. Calibre has a major issue with memory leaks in the conversion process, so to work around this conversions should be done in the background. However if you are just doing a single ad hoc extract ISBN (which is how I usually tend to work) then for speed reasons I don't run it as a background job if you select only a single book.

It sounds like Calibre is crashing on your books when doing the extract when running in the background. No-one else has reported any issues with this, so I am inclined to believe it is something about the books you are scanning. You need to figure out the format that is causing the issue - duplicate the books (create empty books then merge in keeping the original), then one by one remove likely problem formats (starting with PDF) to see if it still errors.

Ababakar
08-13-2011, 06:35 AM
as iīm currently starting using calibre and am in the process of editing all addes books i encountered some things regarding isbn extraction

First one:
Tricky: i got some books out of an edition with some other volumes. These are mentioned on page 2 (with their isbn) - the isbn of the actual book was on page 3. Donīt know if there is a solution for that (maybe a hint if more than one isbn is found). But anyway: Donīt trust the extraction blindly ;)
Not to say that i donīt like your work kiwi - just to remind that things are never perfect ;)
edit: just read the whle thread: this is kind of the same as mentioned in post #63 and #74 - so i guess this is alrady discussed. Just wanted to mention it.

third one:
it took me some hours to figure out such a nice search options as isbn:false - so maybe you should place this in the faqs of the plugin or something. But as i know it now it doesnīt bother me anymore ;)

last one:
great plugin.

only thing left for me as a new user is try to find a way to easily add my comic collection (cbr+cbz). but thatīs another topic.
Second one:
in some of my books there is no word as "ISBN" ü following number but the whole thing ("International Standard Book Number" + following number). Maybe this term can be included in later versions.

kiwidude
08-13-2011, 06:47 AM
@Ababakar - welcome to MobileRead.

Yes as you will have read repeated through this thread there is no magic bullet for grabbing the ISBNs, and there will always be the odd situation where it either cannot find it or gets the wrong one if there are multiple. However these are the exception rather than the norm.

As for your cbr/cbz files - this plugin does not look for the word "ISBN" - the very first implementation did look for preceding words, however due to so many variations (and to cater for bad quality OCR scan errors) it now just looks for a sequence of numbers that start with the right prefix for ISBNs and validate as an ISBN. If it cannot find such in your comic books my guess is that they are images rather than text, which the plugin cannot scan. You can see from looking at the log as to what it text numbers it did attempt to match on, and you can always do a conversion to ePub to verify for yourself what "text" the plugin found available to scan (since for all but PDFs that is exactly what the plugin is doing in the background - silently doing a conversion to ePub and then scanning the html pages for text). If your comic shows up as an EPUB containing image files where the ISBN is then that proves the plugin will be unable to extract it.

Ababakar
08-13-2011, 10:13 AM
oh sorry - i didnīt notice, that it no longer catches for phrases - just for the digits.
Anyway - as i am sorting my collection: i got some books where the isbn could not be extracted but can easily be found via okular (kde pdf viewer - so it is ocrīd and not only an image). + they are on the first 10 pages.
by the way - i also got a lot of ocrīd djvuīs where they could not be extracted but found via strg+f in my pdf-viewer. Did i read right that djvus wonīt work at all?
Anyway: If you want i can collect those pdfs (and djvus) for you (i will simply print out the single page where i find the isbn with cups (linux pdf printer) to keep the data size small). But as i am doing like "5 books a day" this may take a while.

as for cbr/cbz - i know ;) - this was more a general comment than regarding isbn extraction. Only wanted to tell that i am not jet sure if calibre can help me with those ones. But as said - i may discuss this in another topic.

kiwidude
08-13-2011, 11:15 AM
There are PDFs where the text is behind an image - you can select it but as far as Calibre's PDF conversion is concerned it cannot see it as anything but an image. If you find a book where you can convert it to PDF and select the ISBN as text in Calibre's ebook viewer then it might be a case the plugin should be catching. As I said above, click on the View log button to see what numbers it found if any, it might be rejecting them as invalid ISBNs or it may not be finding the numbers at all. It would be fairly unlikely that you have found another valid ISBN combination that is being rejected by the plugin due to something I can control (such as variations of emdash) but anything is possible.

As for djvu's, I don't own any so perhaps someone else can comment, but I think I read somewhere that Calibre didn't use to convert them anyways. If Calibre cannot convert a format to epub, then this plugin will not work with it.

pegasusses
09-09-2011, 04:49 PM
great plugin, thanks :D

kiwidude
09-11-2011, 06:23 AM
Changes in this release:

Upgrade to support the centralised keyboard shortcut management in Calibre


Requires Calibre 0.8.18.

maartencoertjens
11-01-2011, 09:07 PM
Hi

Thanks a lot for this plugin, my metadata searches are a lot more effective now :)
I have one small polite request: would it be possible to have an option where if the extract process doesn't find isbn, that it adds 'not found' in the ISBN column. I keep forgetting which ones I have run the search on and which ones not.
In case this is already an option that I didn't find, my apologies and please be so kind as to point me in the right direction :rolleyes:

Thanks a lot
Maarten

kiwidude
11-02-2011, 04:13 AM
@Maarten - glad you found it useful. If you just do a search for isbn:false that will tell you which books do not have isbns.

maartencoertjens
11-02-2011, 08:19 AM
Thanks a lot. However I am not clear what it shows: the books that don't have isbn, or the ones where isbn has not been extracted yet.
As my library changes all the time with new books, books moved to other libraries, etc I never can remember for which books I already did the extract isbn. So I think I keeping on trying to extract isbn of books that I already tried before.
If an unsuccesful search would put something like 'not found' in the ISBN column, then I could do the search only for the books for which I never did a search.

I hope I managed to explain? :chinscratch:

Thanks a lot!
Maarten

capnm
11-02-2011, 09:10 AM
If an unsuccesful search would put something like 'not found' in the ISBN column, then I could do the search only for the books for which I never did a search.
Maarten

Create a custom column "ISBN extracted" and set it to 'true' on the books when you run the extraction.

If you only want to use extracted ISBNs you'll have to clear the ISBN field before you run the extraction -- it's not going to blank out an existing ISBN in the metadata if it doesn't find one in the text.

kiwidude
11-02-2011, 03:32 PM
Yeah it would have to be a whole new piece of functionality. And to cater for all the scenarios above it would have to be an options dropdown, something like:

If no ISBN is found then:

Do nothing (default),
Set ISBN to 'NotFound' if no existing ISBN
Set ISBN to 'NotFound' always


I leave it to others to comment whether anyone else would find it useful enough to justify the effort.

There is another option. There is already a dropdown option for this plugin which lets it display a search for books that have new or updated ISBNs after an extract completes. I could add another option to this which would allow you to display books for which no ISBNs were extracted. If you want that result to be more "permanent" (as it will get reset when you run another extract) you could select those books and assign some value to a custom column or tag using the bulk metadata screen.

That latter option would certainly be trivial to add if it would be useful.

kovidgoyal
11-02-2011, 11:16 PM
@kiwidude: Do not set an isbn to something that is not valid as an ISBN. Instead use a tag or an extra identifier.

kiwidude
11-03-2011, 04:09 AM
@Kovid - I did wonder if it was even technically possible (whether you had some sort of validation preventing it), but I found that I could indeed in the edit metadata dialog set it to whatever I like.

I completely agree it isn't very desirable from a data integrity point of view.

kovidgoyal
11-03-2011, 05:42 AM
There is validation in place whenever it is used (for example in the metadata download plugins, but not when it is set).

indigene
11-08-2011, 10:59 AM
Hello Kiwidude - Why is that I am not able to install all the plugins you have created? I am on 0.8.8 and for each of your plugin the status mentions "Calibre upgrade required".

Funny thing is I have used many of your plugins, like ISBN extract, but after the few recent Calibre upgrades these plugins have stopped working.

kiwidude
11-08-2011, 11:32 AM
I am guessing you are being fooled by the fact that 0.8.8 is "really" a version number of 0.8.08, which is less than the minimum requirement of most of my UI plugins of 0.8.18. The majority of the plugins were upgraded to support customisable keyboard shortcuts via calibre's new centralised keyboard management (which is wonderful btw).

So just follow the instructions - upgrade your calibre to the latest build, and all should be good.

jlutes
11-10-2011, 05:05 PM
First, thanks for the plugin - it's saved me many an hour. I do have a suggestion though. It's not unusual for me to get a new book with the ISBN in the tags section or file name. Any chance of searching those as well?

kiwidude
11-11-2011, 04:12 AM
@jlutes - extracting from the filename is not at all possible - the original filename is lost the moment you add the book to calibre. However you could setup a regex to extract it from the filename as part of the adding books preferences - if it is something you want to toggle between for different filename formats I would suggest my Quick Preferences plugin.

As for the "tags" section, do you mean inside the metadata of the book? Again you "could" use the "Read metadata from file" option when adding the book to obtain this.

If like me you prefer not to have that option enabled (since the title/author are far too often garbage for my taste and I prefer getting from the filename), then I can see a case for wanting the Extract ISBN plugin to attempt to read it from metadata. I guess my hesitation is the complication of what "priority" any ISBN there should have. Bear in mind that every time a conversion is done on a book (or the Update metadata feature in the Modify ePub plugin is used) it is overwriting that book metadata with the contents of whatever is set in your library. There are cases where:
- an ISBN can be found in the book content and that is the preferred one
- the ISBN extracted from the book is for a different book (such as an advertisement for a related book for the publisher), but there is one set in metadata
- no ISBN can be found in the book content, but one is present in the metadata.

So I don't know which preference should be given if I were to include scanning the metadata :)

jlutes
11-11-2011, 08:52 AM
I was afraid that was the case with the filename. As for priority scanning the metadata tags, my first thought would be to make it user-controllable via an option. I would say set the default action as "look in metedata if there is no match elsewhere" but I could see where someone might want to reverse that under certain circumstances.

capnm
11-12-2011, 10:08 AM
I can see a case for wanting the Extract ISBN plugin to attempt to read it from metadata.

When this plugin was in its infancy, I really wanted to be able to pull the ISBN from the in-book metadata either as a fallback, or to compare, and was frustrated that there was no easy way to, in an already added book, get info from the book's internal metadata into the calibre database.

Then I decided that the garbage level, even in commercial ebooks, was just too high, and maybe ignoring the in-book metadata wasn't so bad after all (and I'm a data miser -- I hate ignoring/discarding potentially useful data).


And while this case:
no ISBN can be found in the book content, but one is present in the metadata.
is rare, this case:
no ISBN can be found in the book content, but an accurate one is present in the metadata.
is really rare.

Unfortunately, this case:
an accurate ISBN can be found in the book content, but a different one is present in the metadata.
is really common, making this:
the ISBN extracted from the book is incorrect, but there is a correct one set in the metadata.
pretty impossible to reliably detect.


I don't mind so much when no ISBN can be found in the content, but these:
the ISBN extracted from the book is for a different book (such as an advertisement for a related book for the publisher).
really nag me, because they're stealthy errors.

Maybe the next step is a Verify ISBN plugin that would check the author/title/ISBN against one of the ISBN pools and flag mismatches and not-founds ....

capnm
11-12-2011, 10:44 AM
@kiwidude:
I still get the occasional epub where Extract ISBN misses, bafflingly, but they're rare enough I just shrug, manually find the ISBN in the text, copy, paste, and move on.

I'll PM you a sample, to look at if you're curious, to ignore if you're busy :)

Thanks.

kiwidude
11-12-2011, 01:48 PM
Changes in this release:

Exclude leading spaces before the ISBN number which prevented some valid ISBNs from being detected.


@capnm - this fixed the issue with the epub you sent me, thx.

Nyssa
12-30-2011, 01:33 PM
Will this overwrite an isbn that is already there (say from downloading metadata) or does it just add the extracted one to the others?

kiwidude
12-30-2011, 01:50 PM
@Nyssa - it will always overwrite any existing ISBN if extract ISBN finds a valid one.

Nyssa
12-30-2011, 02:25 PM
Okay. Thank you.

greatdragon
03-13-2012, 01:28 AM
Super plugin! Thanks much... Calibre still keeps getting better and better...

Any idea why it does not work on all files though? I have some books in my collection for which I CAN find the isbn number when I open the pdf file and look for it myself, but that the plugin didn't get right... Would you be interested in some books for which it doesn't work?

Cheers,

-Stephan

there is a slight mod to the regex you are using and your work load the "\s*" at the start of your regex is unneeded it and in some of my formatted PDF caused an issue as there was no valid spaces so the regex came up false. also removing offers a small performance boost not huge but with regexes any thing you can trim out saves cpu


but great plugin keep up the good work and if I notice any other improvement I will let you know

kiwidude
03-13-2012, 05:20 AM
@greatdragon - first the disclaimer - I am not a regex guru, I know the basics to get by. However note that it is a \s* not \s+, so it should make no difference to your pdf, as it is 0 or more matches?

I can't remember all the reasons why it is there, there have been many iterations of this plugin over its lifetime to get to where it is today. It may have been to catch some case I can't recall. Or it may have been to "soak up" leading spaces to prevent a document with loads of consecutive spaces reporting as matches (since space is a valid character in the next part of the expression).

Now if others who know far more about me than regex agree with your finding then I can look to change it, but I am firmly in the "if it ain't broke don't fix it" camp. :) Performance isn't a reason if the change were to reduce its effectiveness for some reason, particularly since it runs as a background job.

Joanna
05-27-2012, 07:09 PM
I have just switched to a new installation of Calibre Portable and now, for some reason, I get an error everytime I launch Extract ISBN on a .pdf file ("access violation"). The plugin works impeccably with epub files, no other errors occured in Calibre. Any ideas? All help appreciated :).

Dinesh.kaundal
05-28-2012, 02:25 AM
When I upgraded Calibre from 0.8.52 to 0.8.53
ISBN Extract Plugin as when executed it crashes calibre
my system details are as
OS Windows 7 x64 SP1


Again Rolled Back to calibre-0.8.52 it working fine

Regards

Dinesh

kiwidude
05-28-2012, 03:07 AM
Normally I run from the latest source code and the last binary install I had done was 0.8.52 (everything working fine). I just installed the binaries for 0.8.53, then I also find that calibre crashes with 0.8.53 (but only when scanning PDF files.)

Which implies that perhaps Kovid "broke something" in the PDF code (which being C++ is the most likely thing to cause such a crash).

@Kovid - here is what my code does where I believe it is crashing:

def _read_pdf_txt(self, book_path, start_page, end_page):
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
with open(book_path, 'rb') as stream:
tdir = PersistentTemporaryDirectory('_isbn')
with CurrentDir(tdir):
pages = pdfreflow.reflow(stream.read(), start_page, end_page)
with open('index.xml', 'rb') as f:
xml = f.read()
#open('E:\\%d.xml'%start_page,'wb').write(xml)
root = etree.fromstring(clean_ascii_chars(xml))
txt = etree.tostring(root, method='text', encoding=unicode)
return (pages, txt)

kovidgoyal
05-28-2012, 07:54 AM
@kiwidude: 0.8.53 updated to poppler 0.20 which is probably why its crashing. I've committed some code to enable the xml output from pdftohtml use that instead, it will prevent this kind of crash in the future.

pdftohtml(..., as_xml=True)

kiwidude
05-28-2012, 08:20 AM
@Kovid - thx for that, though if I understand you correctly you are saying to use calibre's existing PDF engine via pdftohtml rather than the poppler stuff via pdfreflow, right?

As IIRC pdftohtml is what this plugin originally used, but we found it to be very, very slow (particularly on graphical pdfs). Whereas using pdfreflow allowed the plugin to scan subsets of only the front few and last few pages.

No chance of the pdfreflow stuff getting fixed? ;)

kovidgoyal
05-28-2012, 09:07 AM
It's not a priority for me, the parts of the poppler api that pdfreflow uses are not stable, they change with pretty much every poppler 0.x release, which makes maintaining them a pain. I am switching the new pdf engine to use pdftohtml -xml which produces the same kind of output as pdfreflow, the upside being that I no longer have to maintain pdfreflow's C++ code. The downside, from your perspective, is that pdftohtml does not support specifying a pdf page range for conversion. You have four choices:

1) Maintain pdfreflow yourself, i'm happy to accept patches.

2) Ask the poppler people to implement page ranges for pdftohtml

3) Use another pdf library (calibre has both podofo and pypdf) to first extract the relevant pages and then run pdftohtml on them.

4) Live with the reduced performance

kiwidude
05-28-2012, 10:02 AM
Thanks again Kovid.

(1) would mean more work for me than I want to deal with, (plus cross platform is a big problem I can't support) and (2) would likely involve a lag of months even if they did agree, leaving users stuck in the meantime. BTW if anyone else out there wants to raise the feature request to the poppler team on our behalf anyway please do so, I can't be bothered with mailing lists personally as a way of support.

Option (3) of pypdf sounds a possibility, obviously it will be slower than pdfreflow but given that isn't an option any more it should hopefully still be faster than pdftohtml on the worst pdfs. I shall have to do some testing.

Worst case it is going to be (4). Though I recall from some examples in the early days of this plugin calibre easily taking an hour or more to process the pdf which is not nice when all you want is to see if it has an ISBN - particularly if after all that time the PDF doesn't actually have one!

kovidgoyal
05-28-2012, 11:03 AM
I just committed code that will allow you to use podofo to extract the pages, which should be pretty fast. You will need to wait till the next binary calibre release to actually use the code, since it involved making additions to the podofo C bindings in calibre.

kiwidude
05-28-2012, 12:14 PM
Thanks Kovid.

I had started doing some testing with pyPdf (and now pdofo as well, for the latter I am doing nothing but calling open at this point obviously). It is interesting that unlike pdftohtml & pdfreflow, both pyPdf and podofo choke on an epub which has security applied to it. pyPdf throws a DRMError. podofo spits an *enormous* # of console lines out all saying like this:
3 m_nPredictor=12 m_nCurPredictor=12
before throwing an exception of: ePdfError_UnsupportedFilter

Any way of preventing the console output?

Still working out which is the least evil combination as yet :)

kovidgoyal
05-28-2012, 12:29 PM
Any way of preventing the console output?


Run it in a separate process with fork_job() in simple_worker.py. Both podofo and pypdf are pdf altering libraries and such such they choke on drmed pdfs, where poppler will not.

kiwidude
05-29-2012, 03:02 PM
@Kovid - was there any particular reason why you are steering me towards podofo rather than pyPdf? Is it purely performance? As the pyPdf API (with a DRMError) and not having to fork a console make it a bit nicer to use. Plus I can reverse order the back pages which might make a difference for a small # of books to reliability of match.

Obviously once 0.8.54 is released I will do my own testing to see how significant the performance difference is. The reality is that I just need to grab 10 front and 5 back pages so it may not be that significant?

@everyone else - just to summarise if you hadn't guessed already from all the techy speak, this plugin is broken using calibre 0.8.53. I have a fixed version here ready to go, but it needs calibre 0.8.54. I should be able to release it by this weekend I would assume, so either avoid using the plugin until then or use 0.8.52 in the meantime...

kovidgoyal
05-29-2012, 03:33 PM
pypdf hangs (as in goes into an infinite loop) on some PDF files. So really if you want to use it you should be running it in a worker process anyway. That's why calibre uses podofo rather than pypdf to set pdf metadata.

kiwidude
05-29-2012, 03:39 PM
Thanks for the warning, that sounds like it is best avoided then, darn it. :)

kiwidude
06-01-2012, 12:47 PM
As posted above, this plugin has been broken for PDF scanning due to a change made in calibre 0.8.53. So users have either had to stick to earlier versions, or learn not to extract from PDFs :).

Attached is my next intended version which requires calibre 0.8.54 - it would be appreciated if someone could give it a quick whirl before I officially release it.

I believe there is still a very slim chance in exceptional circumstances that scanning a PDF could still cause a calibre crash. I haven't had this happen, but in theory it could do. However when the next calibre 0.8.55 is available some code I have already included in this plugin version will automatically become active and safely handle that situation without crashing. I didn't want users to be stuck waiting for another week for the sake of something they probably won't have happen.

Please can one or two of you try this and let me know on the thread if any issues, then I will officially release it.

kiwidude
06-03-2012, 08:12 AM
Changes in this release:

Minimum version set to calibre 0.8.54 (but preferred version is 0.8.55)
Performance optimisation for epubs for calibre 0.8.51 to reduce unneeded computation
Change to calibre API for deprecated dialog which caused issues that intermittently crashed calibre
Minor fix to ensure HTMLPreProcessor object is initialised correctly
Change to using different pdf engines for pdf processing due to calibre 0.8.53 breaking the one I was using.
Stability improvement will activate with calibre 0.8.55 by running pdf analysis on a forked thread


Anyone who tried the beta version above, please make sure you force an update to this officially released version (has the same version number).

PeterT
06-03-2012, 11:50 AM
I might be pedantic but I am confused by these entries in the change log...


Minimum version set to calibre 0.8.54 (but preferred version is 0.8.55)
Performance optimisation for epubs for calibre 0.8.51 to reduce unneeded computation


Surely no users on calibre 0.8.51 can install this version to get the reduced computation fix..

kiwidude
06-03-2012, 11:54 AM
Yes you are being pedantic :). The performance change Kovid made was in calibre 0.8.51, and that *was* going to be the minimum version for the next release of this plugin. Then 0.8.53 came out, broke the PDF extraction so I delayed releasing and the new minimum is now 0.8.54, with 0.8.55 preferred when it gets released this week.

Joanna
06-10-2012, 03:02 PM
I'm so glad to see your plugin working again! Thank you!

And one little improvement idea:
I guess plugin authors cannot change anything in the "Edit metadata" dialog but there is one thing that has been bugging me for quite a long time: I would love to have a little "Extract ISBN" button next to the IDs row in "Edit metadata" (right next to "Clear IDs" or "Paste the contents..."). It's not doable, though, is it? This would be great as I would often go to the "Edit metadata" dialog, than realize there is no ISBN for a given book, close the dialog, extract ISBN and open the "Edit metadata" dialog once again... not very practical.

kiwidude
06-10-2012, 03:42 PM
@Joanna, no it is not currently possible. There is no provision in calibre to enable launching a plugin from the edit metadata dialog, only from the main toolbars or context menu.

All I can suggest is having a custom column to display isbn or having a look at the book details panel on the right before opening the edit metadata window.

Joanna
06-10-2012, 04:11 PM
Thanks, that's what I thought :(. It's a pity, Extract ISBN should be a built-in Calibre feature :).

Thanks for the tips; unfortunately I don't have enough space to have ISBN shown as a custom column or in book details. My workaround consists of trying to extract ISBN before I even go to the "Edit metadata" dialog :).

affa
06-15-2012, 10:10 PM
can i ask a reallllllly silly question? what is the main benefit of extracting an isbn? is it so you can then easier grab proper metadata?

kiwidude
06-16-2012, 02:07 AM
The biggest use I personally make of it is where I am importing books where the title and author fields are not set, such as by the book having a random filename. Rather than manually typing it in you could extract isbn and do a metadata download with the option to overwrite title and author.

Metadata downloads with an isbn will all but guarantee you a better likelihood of the right metadata from the website, since most metadata plugins will lookup by isbn if available and fallback to title and author search if not. The latter being more error prone due to spelling errors, typos, series info in title field etc.

And a small minority undoubtedly use it because they are sufficiently fussy to want the isbn field to contain the value for thei specific edition of that book.

stanmarsh
07-28-2012, 07:09 PM
hello kiwidude,

i'm wondering if its possible to add some sort of group limit to extraction?
i tried selecting all books but it will take to long to finish, the group limit will group the queues into 10, 20 or 50 (depends on how powerful the computer).

e.g.

library 5000 books with no identifier
using identifiers:"=false"
select all the books
extract isbn (group into 10 - slow computer/usb2)
5000 books = 500 groups (groupings could be similar to find duplicate)
translate to 500 jobs
--go to work - shut down computer--
check calibre-jobs for what group you are in (N)
close calibre
--start calibre--
extract isbn from group N to group N


something like that, is that possible? it will prevent calibre from hanging.

thanks

kiwidude
07-31-2012, 05:28 AM
@stanmarsh - give this version a whirl. By default the batch size is 100, but you can increase/reduce it in Preferences -> Plugins -> Extract ISBN -> Configure plugin.

Note that there are a couple of side effects if the number of books you have selected is more than your batch size causing multiple jobs to be run:

You will get prompted each and every time a batch completes. This will not be changed and is the same behaviour as what you see with bulk metadata downloads.
As part of the output this plugin displays what books it did not even attempt to retrieve data for (e.g. book had no formats). This information will now get displayed on the first batch job completing only.
There is an option which some users have turned on to execute a search to show which books have been updated. If you use this option, you are only going to see books for that batch. When the next batch completes, you will then only see books for that batch and so on.

kiwidude
08-01-2012, 05:15 AM
Changes in this release:

Split bulk extraction into batches with size changeable via plugin configuration

stanmarsh
08-05-2012, 10:01 PM
hello kiwidude!

thanks for implementing the feature request!:thanks: will test it out!:thanks:

userpaul
09-26-2012, 07:34 PM
Fantastic...thank you

myce
10-05-2012, 05:20 AM
Extract ISBN is really great at extracting ISBNs from the books text. But this made it stumble.

From "The Definitive Guide to How Computers Do Math: Featuring the Virtual Diy Calculator" page 2:
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data is available.
ISBN-13 978-0471-73278-5
ISBN-10 0-471-73278-8
results in the log file:

Invalid ISBN match: 877-762-2974
Valid ISBN10: 3175723993
Invalid ISBN match: 317-572-4002
Invalid ISBN match: -13 978-0471-73278
Invalid ISBN match: -10 0-471-73278-8

I understand that it detects 3175723993 as a valid ISBN. But maybe you could make it reparse substrings if the number it found is longer than 10/13 digits. Or maybe even look for the string ISBN.{,3}1[03] explicitly and give the numbers in it's vicinity higher precedence.

theducks
10-05-2012, 09:51 AM
Extract ISBN is really great at extracting ISBNs from the books text. But this made it stumble.

From "The Definitive Guide to How Computers Do Math: Featuring the Virtual Diy Calculator" page 2:
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data is available.
ISBN-13 978-0471-73278-5
ISBN-10 0-471-73278-8
results in the log file:

Invalid ISBN match: 877-762-2974
Valid ISBN10: 3175723993
Invalid ISBN match: 317-572-4002
Invalid ISBN match: -13 978-0471-73278
Invalid ISBN match: -10 0-471-73278-8

I understand that it detects 3175723993 as a valid ISBN. But maybe you could make it reparse substrings if the number it found is longer than 10/13 digits. Or maybe even look for the string ISBN.{,3}1[03] explicitly and give the numbers in it's vicinity higher precedence.
IMHO only 1 parse rule at a time should be used. the last 2 broke that rule and therefore failed to find a valid ISBN. Space or Dash, not both in the same substring

once found (10 character ISBN 10), the check digit should validate (the NANP phone number should fail in near 100% of the cases the FAX number is one of those :rolleyes: edge cases )

myce
10-05-2012, 05:45 PM
IMHO only 1 parse rule at a time should be used. the last 2 broke that rule and therefore failed to find a valid ISBN. Space or Dash, not both in the same substring

Well, yes and no. Had the publisher decided to use spaces instead of dashes, your suggestion would still find the number 13 978 0471 73278 5 which wouldn't be valid without parsing all substrings of 13 digits length.

theducks
10-05-2012, 06:25 PM
Well, yes and no. Had the publisher decided to use spaces instead of dashes, your suggestion would still find the number 13 978 0471 73278 5 which wouldn't be valid without parsing all substrings of 13 digits length.

have you seen a book written ISBN 10 or ISBN 13 ?

ISBN and ISBN13 are more normal (ISBN 10 is redundant. ISBN is 10 chars)