Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 03-23-2011, 09:37 AM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
[GUI Plugin] Extract ISBN

This plugin can be used to try to find the ISBN for a book using the text within a book format. It is intended as an alternative to various script based solutions to this problem posted in this thread.

Main Features of v1.4.4
  • Scans all formats for the selected book(s) in preferred input format order until an ISBN-13 or ISBN-10 is found
  • Runs as a background job in Calibre, prompting you to update when the scanning is completed.
  • Scans only the book content, excluding HTML tag markup.
  • For PDF formats, scans only the first 10 pages, then if ISBN not found, the last 5 pages in reverse order.
  • For other formats, scans files at the front, then a number of end files in reverse order before the remainder of the book.
  • Restricts valid ISBN-13s to those that start with 977, 978 or 979. You can add additional prefixes in the configuration if required.
  • Optionally perform a search when completed showing you only the books updated (default is off). Some users may use this to then perform a metadata download.

Special Notes:
  • Requires calibre v0.8.54 or later.
  • As this runs in the background, you must be careful not to change the books being scanned while it is running. Changing the metadata such as title or author, deleting a book or performing a conversion will risk causing a problem. Restrict any editing to other books in your library while the scan is running and you will be fine.

Installation Notes:
  • Download the attached zip file and install the plugin/add to context menu or toolbar/restart calibre as described in the Introduction to plugins thread.

Paypal Donations:
  • If you find this or any of my other plugins useful please feel free to show your appreciation. I have spent many hundreds of unpaid hours in their development and support so any encouragement for me to continue is appreciated!

Version History:
Spoiler:

Version 1.4.4 - 30 Jul 2014
Support for upcoming calibre 2.0

Version 1.4.3 - 01 Aug 2012
Split bulk extraction into batches with size changeable via plugin configuration

Version 1.4.2 - 03 Jun 2012
Minimum version set to calibre 0.8.54 (but preferred version is 0.8.55)
Performance optimisation for epubs for calibre 0.8.51 to reduce unneeded computation
Change to calibre API for deprecated dialog which caused issues that intermittently crashed calibre
Minor fix to ensure HTMLPreProcessor object is initialised correctly
Change to using different pdf engines for pdf processing due to calibre 0.8.53 breaking the one I was using.
Stability improvement will activate with calibre 0.8.55 by running pdf analysis on a forked thread

Version 1.4.1 - 12 Nov 2011
Exclude leading spaces before the ISBN number which prevented some valid ISBNs from being detected.

Version 1.4.0 - 11 Sep 2011
Upgrade to support the centralised keyboard shortcut management in Calibre

Version 1.3.7 - 02 Jul 2011
Fix bug of question dialog when metadata has changed not being displayed

Version 1.3.6 - 12 Jun 2011
Fix bug occurring when same ISBN extracted for a book
For non PDF file types, based on #files in books scan first x files, last y in reverse then rest
When scan fails, still give option to view the log rather than standard error dialog

Version 1.3.5 - 25 May 2011
Add yet another unicode variation of the hyphen separator to the regex

Version 1.3.4 - 21 May 2011
Run the ISBN extraction out of process to get around the memory leak issues

Version 1.3.3 - 19 May 2011
Ensure stripped HTML tags replaced with a ! to prevent ISBN running into another number making it invalid

Version 1.3.2 - 17 May 2011
Strip the <style> tag contents to ensure panose-1 numbers are not picked up as false positives

Version 1.3.1 - 06 May 2011
Strip non-ascii characters from the pdfreflow xml which caused it to be invalid
Support the ^ character being part of the ISBN number
Attempt to minimise any memory leak issues caused by this plugin itself

Version 1.3 - 29 Apr 2011
Do all scanning as a background job to keep the UI responsive
Remove all interactive UI options - it will now always scan all formats in preferred order
Make sure that ISBN-13s start with 977, 978 or 979 (configurable).
Exclude the various repeating digit ISBNs of 1111111111 etc.
Exclude all html markup tags to prevent issues like the svg sizes being picked up as ISBNs
Include endash and other dash variants as possible separators
When scanning PDF documents, scan the last 5 pages in reverse order so it is the last ISBN found
Configuration option for ISBN13 prefixes and option to show updated books when extract completes

Version 1.2.1 - 09 Apr 2011
Support skinning of icons by putting them in a plugin name subfolder of local resources/images

Version 1.2 - 03 Apr 2011
Rewritten for new plugin infrastructure in Calibre 0.7.53
ISBN matching regex replaced using an approach from drMerry
PDFs now processed with new Calibre PDF engine to scan just first 10 and last 5 pages

Version 1.1 - 28 Mar 2011
Add configuration options over the scan behaviour (default + alternate)
The options you have are:
Ask me which format to scan
Scan only the first format in preferred input order
Scan all formats in preferred input order until an ISBN found

Version 1.0.1 - 24 Mar 2011
Skip book formats which we are unable to read, such as djvu
Display progress in the status bar
Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple.
It will use the first found based on your preferred input format order list from Preferences->Behaviour

Version 1.0 - 24 Mar 2011
Initial release of Extract ISBN plugin
Attached Thumbnails
Click image for larger version

Name:	Screenshot_1_Summary.png
Views:	1486
Size:	36.0 KB
ID:	68860   Click image for larger version

Name:	Screenshot_2_Configuration.png
Views:	906
Size:	15.8 KB
ID:	69073  
Attached Files
File Type: zip Extract ISBN.zip (76.6 KB, 4390 views)

Last edited by kovidgoyal; 07-29-2014 at 11:41 PM. Reason: v1.4.4 Released
kiwidude is offline   Reply With Quote
Old 03-23-2011, 09:54 AM   #2
talonius
Junior Member
talonius began at the beginning.
 
Posts: 9
Karma: 12
Join Date: Mar 2011
Device: Kindle
I. Love. You.

Now... if we could add the extraction to the Edit Book Details window (like to the right of the ISBN text box) and then have an option to download metadata if an ISBN is found... I would have your baby.

(Although, yes, I can edit a batch and then download a batch. I tend to edit one at a time so I think one at a time. )

This has worked beautifully on 480 out of 500 books. And the 20 that didn't work I confirmed were PDFs where the contents were JPG images rather than text -- so no way for the regex to pick up the ISBN.

Oh, some sort of progress indicator would be beneficial. (Dunno if possible.)

Last edited by talonius; 03-23-2011 at 10:07 AM.
talonius is offline   Reply With Quote
Old 03-23-2011, 10:16 AM   #3
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Cool, glad it worked for you!

I agree that some sort of progress indicator would be useful. I just wanted to get "something" out there to see what the interest was, how people wanted to approach the multiple format/selection issue etc.

Your point about the edit book details window also confirms why I did not invest a great deal more effort at this point beyond "proving it was possible". As this plugin just wires together and resuses a few bits of Calibre code there really isn't any technical reason why it couldn't be built natively into Calibre. It is entirely down to Kovid and whether he wants to make the functionality available from screens like the Edit Metadata and Bulk Metadata dialogs.
kiwidude is offline   Reply With Quote
Old 03-23-2011, 11:22 AM   #4
pchrist7
Addict
pchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animalspchrist7 is kind to children and small, furry animals
 
pchrist7's Avatar
 
Posts: 385
Karma: 6514
Join Date: Aug 2010
Location: Denmark
Device: Kindle 3 3G+Wifi
Question Just me being SILLY ! - Sorry

Quote:
Originally Posted by talonius View Post
I. Love. You.
.. I would have your baby.

(Although, yes, I can edit a batch and then download a batch. I tend to edit one at a time so I think one at a time. )
Wow - I'm getting old here.
Being a batchelor, old and all, I haven't been keeping upto date with procreation, I see


Sorry All, especially talonious & kiwidude !!!
WILL TRY to just read from now on, instead of being "funny"

Last edited by pchrist7; 03-23-2011 at 11:27 AM.
pchrist7 is offline   Reply With Quote
Old 03-23-2011, 01:08 PM   #5
talonius
Junior Member
talonius began at the beginning.
 
Posts: 9
Karma: 12
Join Date: Mar 2011
Device: Kindle
Minor issue: If there's a format stored in Calibre that Calibre doesn't know how to handle (DejaVu in this instance) the plugin throws an error and aborts processing.

Possible optimization: Abort searching through the book once a certain percentage/amount of text has been searched. This would help speed up the search for 95% of the books.

Building it into Calibre would be fantastic but since this is the major roadblock to me finishing my catalog, I'm going to continue to push it. <g> No worries, I'm looking at how to do all of my suggestions myself as possible improvements. I work in C#/C++ professionally, just not Python/Calibre. I'll just have to buckle down and do some (gasp!) reading.

As for jokes... ha! Trust me, I'm far from serious. One reason I don't participate in projects is because my joking attitude tends to grate on the more serious folks who tend to inhabit the programmer's world.
talonius is offline   Reply With Quote
Old 03-23-2011, 03:32 PM   #6
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@Talonius - I will push a 1.0.1 version shortly which will ensure any errors are more gracefully handled. It will also display progress in the status bar.

The optimization stuff is a tough one. The problem is that I have seen books where the copyright/ISBN information has been put at the end of the EPUB. Granted this is the exception rather than the rule, but maybe others have seen it frequently? This is the sort of operation that you will only do once on your books though so performance shouldn't be too much of an issue...

Also, I think most of the slowdown will be in the time taken to convert each book into text, not the bit the plugin does of applying regex expressions on each file in it. I haven't profiled it but I am pretty confident that will be the case.

What I have done is get it to short-circuit gathering ISBNs once it has found an ISBN and finished processing the current internal file of the converted format. The logic I "borrowed" from bazbar scanned the whole book and built up lists of ISBNs should a book have multiple ISBN13s for instance. I don't know enough about when that ever happens (most books I have seen have only either one or both of an ISBN10/ISBN13 but not more than that). Finishing processing a file (hopefully all ISBNs are on the same one) and then stopping should be enough. This won't help speed up books with no ISBN inside though.

I am also about to make it that if you ctrl+click or shift+click on the toolbar button it will do a non-interactive decision of which format to interrogate when you have multiple. This will be based on your preferred input format list in Preferences for now. I'll wait for suggestions for alternatives before doing anything else around that. For people who only have formats produced by converting the same version that will work well. Where it won't is say if they got a PDF from somewhere and an EPUB from somewhere else, and the EPUB has had the ISBN stuff removed. Still, at least you will see in the report which books it failed to find an ISBN for, and you can always then just do a normal toolbar button click to get the interactive choice of format to extract from.

Last edited by kiwidude; 03-23-2011 at 04:06 PM. Reason: Added more info about performance bottleneck
kiwidude is offline   Reply With Quote
Old 03-23-2011, 04:15 PM   #7
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
v1.0.1 Released

I've mentioned most of this in the previous post but to recap:
  • Skip book formats which we are unable to read, such as djvu
  • Display progress in the status bar
  • Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple. It will use the first found based on your preferred input format order list from Preferences->Behaviour
kiwidude is offline   Reply With Quote
Old 03-25-2011, 03:01 AM   #8
garcle
Connoisseur
garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.
 
garcle's Avatar
 
Posts: 54
Karma: 442
Join Date: Oct 2010
Location: Detroit
Device: iPad
Great and very useful plugin, thanks much.

one comment though, I have been able to (inadvertently) "choke" the plugin on a document with 1800 pages and 2million words. It is a text pdf, and as it turns out there is no isbn amongst the 2 million words. Is it possible to have a "fail gracefully after x time" capability?

Thanks again for what is otherwise a very useful plugin.
garcle is offline   Reply With Quote
Old 03-25-2011, 05:11 AM   #9
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@garcle - see my comments above in post #6. The way to test this would be to go into convert, choose search & replace and click one of the wizard buttons. That will ask Calibre to convert the document in the exact same way that my ISBN extract does. Check how long it takes for it to do this with your big PDF file to get to a point of text being displayed in the wizard box, versus how long it takes the extract ISBN functionality.

If the times are comparative, there is nothing I can do, at least not without rewriting the text conversion functionality to perhaps say just convert a small % of the document. Which I have no intention of doing myself

OTOH if you think the ISBN functionality is still significantly slower than the S&R wizard then I could take a look at it. If you point me at a download somewhere of a PDF typical of the issue I will see what I can do.
kiwidude is offline   Reply With Quote
Old 03-26-2011, 12:15 AM   #10
Doug-W
Member
Doug-W began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Feb 2011
Device: Nook
Quote:
Originally Posted by kiwidude View Post
I've mentioned most of this in the previous post but to recap:
  • Skip book formats which we are unable to read, such as djvu
  • Display progress in the status bar
  • Ctrl+click or shift+click on the toolbar button to do a non-interactive choice of formats where your book has multiple. It will use the first found based on your preferred input format order list from Preferences->Behaviour
Could you make that last step be two options?
1) Run in non-interactive by default or interactive by default.
2) Follow preferred input format, or continue searching all if not found in first? I format down some of my epubs which is my preferred format.
Doug-W is offline   Reply With Quote
Old 03-26-2011, 07:35 AM   #11
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@Doug-W - thanks for the suggestions. I am applying them at the moment and will push a new version when done.

When searching all formats, do you think that option should be dependent on whether the user has interactively chosen a format? i.e. If I have interactively chosen a specific format, it should always stop after seaching just that format. Whereas the "search all formats until found using preferred order" only applies when you are doing a non-interactive search?

Hope I explained myself, it is very difficult to wrap the wording around as per the screenshot - any suggestions for alternate wording welcomed

EDIT: Removed the screenshot, came up with a simpler approach...

Last edited by kiwidude; 03-27-2011 at 11:01 AM.
kiwidude is offline   Reply With Quote
Old 03-27-2011, 11:16 AM   #12
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
v1.1 Released

This release adds some configuration options over the scan behaviour for when there are multiple formats for a book. You can configure both a default behaviour and an alternate behaviour (the latter when you shift+click or ctrl+click on the plugin as a toolbar button).

The options you have are:
  • Ask me which format to scan
  • Scan only the first format in preferred input order
  • Scan all formats in preferred input order until an ISBN found
Note that the last option can be slow, if you care about performance. As I have commented previously on this thread any performance issues I can do very little about - it is down to the performance of the converters and the nature of the conversion where the bulk of the time is spent.
kiwidude is offline   Reply With Quote
Old 03-27-2011, 11:16 PM   #13
garcle
Connoisseur
garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.garcle has a complete set of Star Wars action figures.
 
garcle's Avatar
 
Posts: 54
Karma: 442
Join Date: Oct 2010
Location: Detroit
Device: iPad
Any way to force a refresh on the book list?
the isbns dont show up in the book list (bit do show in the book metadata editor form) after the plugin runs.
garcle is offline   Reply With Quote
Old 03-28-2011, 06:52 AM   #14
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
v1.1.1 Released

This adds two things:
  • Ensure an ISBN custom column or the Identifiers count in the tag browser is refreshed after retrieving ISBN values
  • Adds an Abort button to the select format dialog, in case you accidentally started interactively searching a large selection

Thanks @garcle for reporting the refresh issue.
kiwidude is offline   Reply With Quote
Old 03-29-2011, 04:23 AM   #15
Calibrefan
Enthusiast
Calibrefan began at the beginning.
 
Posts: 47
Karma: 12
Join Date: Feb 2011
Device: Kobo Aura, Sony PRS-350 and PRS-T1
Thanks kiwidude for this very useful plugin!
Calibrefan is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Extract ISBN from file name ChristianQ Calibre 56 05-20-2012 09:59 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM
[Old Thread] Auto Extract ISBN-Feature request UnraisedArc Calibre 60 03-23-2011 09:31 AM
Displaying ISBN column in the main GUI tilleydog Library Management 26 02-25-2011 04:08 AM
Extract ISBN from PDF? mdroberts Calibre 10 12-15-2009 01:35 AM


All times are GMT -4. The time now is 09:25 PM.


MobileRead.com is a privately owned, operated and funded community.