[GUI Plugin] Extract ISBN - Page 2

sdspieg · 03-29-2011, 06:33 PM

Super plugin! Thanks much... Calibre still keeps getting better and better...

Any idea why it does not work on all files though? I have some books in my collection for which I CAN find the isbn number when I open the pdf file and look for it myself, but that the plugin didn't get right... Would you be interested in some books for which it doesn't work?

Cheers,

-Stephan

kiwidude · 03-29-2011, 07:01 PM

Hi Stephan - There are two reasons I can think of. The first is that your PDF is actually not containing text and instead just contains images of the content. If that is the case, there is nothing that can be done until the book is OCR'd.

If that is not the case, then perhaps I need to adjust the regular expression that finds the ISBN numbers. If you have a PDF example, drop me a PM with a link to somewhere I can download it from.

drMerry · 03-29-2011, 08:10 PM

Hi, great plugin.

Here is some code you can add to your plugin.
It will read isbn that uses spaces in stead of -
it also will import isbn codes dat start as
lsbn (L)
1sbn (one)
IS8N (eight)
I8BN
and combinations of this.
This in case of scanned books.

Please keep up the good work!

(and to other people, this is an edit of the original plugin, it works in my case but I do not guarantee it works for you. To be save, please use the original plugin and wait of original update!)

sdspieg · 03-29-2011, 08:28 PM

Quote:

Originally Posted by kiwidude

Hi Stephan - There are two reasons I can think of. The first is that your PDF is actually not containing text and instead just contains images of the content. If that is the case, there is nothing that can be done until the book is OCR'd.

Nope. They're text-based

Quote:

If that is not the case, then perhaps I need to adjust the regular expression that finds the ISBN numbers. If you have a PDF example, drop me a PM with a link to somewhere I can download it from.

I did.

-Stephan

kiwidude · 03-29-2011, 08:31 PM

Quote:

Originally Posted by drMerry

Hi, great plugin.

Here is some code you can add to your plugin.
It will read isbn that uses spaces in stead of -
it also will import isbn codes dat start as
lsbn (L)
1sbn (one)
IS8N (eight)
I8BN
and combinations of this.
This in case of scanned books.

Please keep up the good work!

(and to other people, this is an edit of the original plugin, it works in my case but I do not guarantee it works for you. To be save, please use the original plugin and wait of original update!)

Hi DrMerry,

I've just taken a quick look at your changes to look to incorporate into the next release this weekend (most of my plugins will get a new release to support some changes Kovid is making). However I am not too sure about that replacement of spaces in the text. Doesn't that cause a problem for the entire rest of the regex, such as "International Standard Book Number" etc? What was the thinking behind that?

EDIT: Never mind, sorry, clearly looked at it *too* quickly, I see you are applying it at the point after the regex has been applied. I will look to include in my next release, thx.

drMerry · 03-29-2011, 08:42 PM

Quote:

Originally Posted by kiwidude

Hi DrMerry,

I've just taken a quick look at your changes to look to incorporate into the next release this weekend (most of my plugins will get a new release to support some changes Kovid is making). However I am not too sure about that replacement of spaces in the text. Doesn't that cause a problem for the entire rest of the regex, such as "International Standard Book Number" etc? What was the thinking behind that?

I've got a lot of books that had no ISBN and did not get one after using your plugin.
It occured to me that 90% of my non-isbn tagged books (total of 2018 have no ISBN in calibre) had an isbn in the text.

Some had 1sbn or IS8N, but most had 978 123 456 x

It does not give problems with isbn you can try some in http://www.regextester.com/
This is because you build a good regex yourself

I strip the spaces out of match(1). This is the match that only contains numbers, spaces and - (At this moment I think, maybe there are also 978.123.456.x numbers).
The last part had to move to avoid a space at the end of the code. This gave some errors.

One thing: You used the match(1), but I'm not sure why? Why did you made 2 matches, match 0 is never used.

One other thing: I also have books with ISBN on the last, second-last or last page before the footnotes. Is it possible to add an option to start the search at the end of the file?

Hope I've been clear enough. Going to sleep now (it's 1.41 here, 6.00 is wake-up time

)

kiwidude · 03-29-2011, 08:54 PM

Thx for the extra info, I had edited my post while you were typing obviously

I claim no credit for the regex or that loop of code applying the matches - as per my very first post I took this code from bazbar's script that people were using. I figured they had in turn built it based on all of the earlier versions so just assumed it was "proven". Since you have questioned it I will take a look.

As for your question about searching from the end of the book. I assume this is a performance thing - and my answer remains the same as previously. I am at the mercey of the current implementation of the Calibre input converters. They do not stream the results to me, I cannot control their direction. I give them a path, and when they are "done" they give me a bunch of stuff representing the converted EPUB back.

I hadn't appreciated any real performance issues with piggy-backing off this until just now when I tried a PDF which had graphics in it that Stephan sent me above, and now I see why some of you would like something faster! I will ask in the dev forum if there is any possibility of an overload or something that would support that - all we really want for this functionality is something like the first 10 pages and (maybe) last 5 of a book. I'm not optimistic they will consider this plugin worth the effort if it is anything but trivial to support it but you don't know unless you ask...

kiwidude · 03-29-2011, 08:58 PM

@Stephan - Thx for sending me the PDF/EPUB. The EPUB returns an ISBN with the changes made by DrMerry - it is because the ISBN has spaces in it rather than dashes. So my next release will support that for you.

The PDF I don't know why it failed as yet (DrMerry's additions did not find that ISBN either). I'll post back here when I figure out why...

EDIT: Ok, the reason for the PDF you sent me is as per my first response on this that particular part of the PDF is being seen by Calibre as an image, not as parseable text. You can see this if you do a conversion to EPUB in Calibre, the first pages of this at least are converted to imges.

I also discovered that it is the cover page that causes the Calibre conversion to run so slowly in this instance (I stripped off the leading 5 pages into a new PDF). I will raise a ticket on the bug tracker and see if someone can take a look, perhaps they can both explain why the ISBN page is being treated as an image (when the text is selectable in a PDF reader) and maybe an optimisation to not take so long on the cover page.

user_none · 03-29-2011, 10:13 PM

Quote:

Originally Posted by kiwidude

the first pages of this at least are converted to imges.

The current PDF engine does not support text under images. This is why you can select it but it's coming through as an image.

Quote:

Originally Posted by kiwidude

I also discovered that it is the cover page that causes the Calibre conversion to run so slowly in this instance

calibre uses pdftohtml to turn the PDF into HTML which it then cleans up and converts. There are easily 100 special processing rules to clean up the HTML from pdftohtml. They are all regular expression based. Most likely those pages are producing very complex and messy output which is causing a large number of rules to be run.

No one has any desire to fix the existing engine. A new PDF engine is in the works but development has stalled. Finishing the new engine would be a better time investment than trying to further work around pdftohtml issues.

drMerry · 03-30-2011, 02:35 AM

Quote:

Originally Posted by kiwidude

Thx for the extra info, I had edited my post while you were typing obviously

I claim no credit for the regex or that loop of code applying the matches - as per my very first post I took this code from bazbar's script that people were using. I figured they had in turn built it based on all of the earlier versions so just assumed it was "proven". Since you have questioned it I will take a look.

Some new hobby time to spent

But if you have questions, please mail me, you got my address.

3 new problems occured:
1. It was not posible for me to read the isbn for in this case (# equals a \d (not x)). This is te complete line, so after the last number there is a linefeed.
ISBN 978 ## ### #### #
2. I've seen some really strange ISBN-numbers. Do not know if you even want to support it but I've seen a few cases the control number being a letter (not x). I've seen it just a few times. But the most strange thing is, in case the number should be 4, they used D. As a programmer you know about the problem of starting with 0 or 1. I do not know if this type of isbn is used often, In my collection of 3000+ books I have only seen it 4 times, but at the moment I got 2000 books without ISBN (lot of gutenberg project so they will never get an isbn).
3. I've also got some books with only a isbn, not an indication it is isbn. In all these cases the isbn is on one of the first 3 pages or on one of the last 2. This is a moment to use the regex without match 0. I could think of an (drobdown/submenu) option to use the regex this way. By default u use the standard way as it is / will be now. If you still got no ISBN error, you could use the second option (in this case I should not update existing isbn-numbers because it is less certain it is an isbn, or you should have to check the isbn validity before inserting it).
4. I have some isbn numbers the dit not got converted very well using ocr. In this way I got this letters or signs as numbers. To correct this, you have to check the validity of isbn-numbers if you replace the letters. Creates a much slower script. I would like to have it, but I build it myown if I will be the only user

.
Signs i got:
i I (i) ! l (L) | (or) { } (all for the number 1)
o O B 8 (for a bad printed zero but also for a 8)
b (for a 6 or 10)
d (for a 01)

So some new regex to be made. But the question in this case is what is the best way. To get all mentioned cases into isbn, the plugin will be slower (in bad cases) but it is possible the parser will get isbn-numbers in books that do not have an isbn-number mentioned (I think there will be cases...)

Quote:

Originally Posted by kiwidude

As for your question about searching from the end of the book. I assume this is a performance thing - and my answer remains the same as previously. I am at the mercey of the current implementation of the Calibre input converters. They do not stream the results to me, I cannot control their direction. I give them a path, and when they are "done" they give me a bunch of stuff representing the converted EPUB back.

If it is possible but only a case of performance, you could possible ad it as a submenu option. It is not possible to set is as default, would be a performance losse, but if someone wants it, he/she could use it.

Thank you for your time and if you want (even more

) info, just reply here or mail me.

kiwidude · 03-30-2011, 05:17 AM

@user_none - thanks for taking the time to take a look and explain. I really know zero about PDF conversion so appreciate the information you have given here and on the ticket. I fully expected that if it was anything other than a trivial change there would not be any interest in making a change to the code. I will try Kovid's suggestion of taking a look at the reflow.py stuff for processing PDF files.

@drMerry - to be honest I don't have a massive interest in trying to support really badly OCR'd documents. I would much rather support the majority of what users are after which is an ISBN from valid documents (that they are less likely to be binning!). As performance is already an issue I don't intend to compound that.

I have a suggestion from Kovid about some alternative code in Calibre to use for handling PDFs (part of the new in progress PDF engine) so I will give that a go and see what options if any I could introduce around it. I haven't looked at it yet but from what Kovid mentioned scanning the first 10 pages only will be easy to support, however I would guess scanning the last few might not be possible without scanning the whole thing. Still, at least offering that as a config option could help performance for the majority of docs where ISBN is at the front.

In terms of your case (1) above of the ISBN immediately followed by a linefeed. If you can PM me a link to where I can download the doc I will give it a spin once I have changed the PDF handling code and see if we can handle that case.

drMerry · 03-30-2011, 05:21 AM

@kiwidude

I made a mistake in my regex.
The new regex will not find isbn-numers with a length of 10 without any spaces, dots or dashes. This is because it checks 2 groups now, 10-24 positions and 1 last position. This must be 9-24

After I realized by your post, the regex was used based on proof of concept, I thought about some optimalization and I concluded this (will test it tonight at home)

isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end)
http://www.isbn-international.org/faqs/view/5#q_5

so you do not have to test for words like isbn or something like that. All extra test will consume exponentional time (you have to test for isbn AND isbn: AND 1sbn and.....)

so I thought op this implementation
it will search for digits (not 0-9, this is 10 tests, a digit is 1 test)
it will search for white-characters (\s) so spaces are counted but also tabs are
quick search: (97[89](\d{6}|\d{9})[\dxX]) use match(0)
optimal search (97[89][\d\s\-\.]{6,24}[\dxX]) use match(0)
extended search like optimal, but you also add some of the mentioned characters to get more isbn numbers. This is a heavy implentation because you have to replace the numbers and afterwards have to test if you got a real isbn, otherwise you still have to tell it is not found.

To be more sure you get isbn-numbers and not phone numbers, you can add some extra info like it may not be prefixed with the word tel or + or 0
But I think this is not needed as far as I know. I did a quick test with implementation 1 and 2 on a textfile (not with calibre).
Both processed the file in a fraction of time compared to the original regex, and I got more numbers than in the original case.

I post this on the forum because I hope there are people who can think of a (regular) event where my idea would fail and it would work if you just tested the if the word ISBN was available as prefix.

drMerry · 03-30-2011, 05:24 AM

again we where working at the same time in this threat

I can see your opinion about bad ocr. That is a valid opinion (off-course, opinions are always valid...)
May be it is an idea for another plugin. I will think of that in the (near) future.

kiwidude · 03-30-2011, 05:44 AM

@drMerry - yeah we crossed posts again. I have no objection at all to changing the regex or matching algorithm if you have a better one - as I said above I did not write that part of the code nor have I ever investigated all the variations. I just wanted to offer an easy to use wrapper around something that people could use within Calibre rather than running external scripts etc.

So if you are willing to do the investigation can come up with an improved version that isn't noticeably slower than the existing one then by all means please send it to me. PM me your email address if you like and we can swap info there. My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.

theducks · 03-30-2011, 12:53 PM

Quote:

isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end)
http://www.isbn-international.org/faqs/view/5#q_5

Note:
ISBN-13 can not have an 'X' (EAN13 is digits only), that only applies to ISBN-10

03-29-2011, 08:10 PM	#18
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	some more code Hi, great plugin. Here is some code you can add to your plugin. It will read isbn that uses spaces in stead of - it also will import isbn codes dat start as lsbn (L) 1sbn (one) IS8N (eight) I8BN and combinations of this. This in case of scanned books. Please keep up the good work! (and to other people, this is an edit of the original plugin, it works in my case but I do not guarantee it works for you. To be save, please use the original plugin and wait of original update!) Last edited by kiwidude; 05-28-2012 at 11:33 AM. Reason: Remove attachment so others do not get confused

03-29-2011, 08:58 PM	#23
kiwidude Calibre Plugins Developer Posts: 4,684 Karma: 2162246 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Stephan - Thx for sending me the PDF/EPUB. The EPUB returns an ISBN with the changes made by DrMerry - it is because the ISBN has spaces in it rather than dashes. So my next release will support that for you. The PDF I don't know why it failed as yet (DrMerry's additions did not find that ISBN either). I'll post back here when I figure out why... EDIT: Ok, the reason for the PDF you sent me is as per my first response on this that particular part of the PDF is being seen by Calibre as an image, not as parseable text. You can see this if you do a conversion to EPUB in Calibre, the first pages of this at least are converted to imges. I also discovered that it is the cover page that causes the Calibre conversion to run so slowly in this instance (I stripped off the leading 5 pages into a new PDF). I will raise a ticket on the bug tracker and see if someone can take a look, perhaps they can both explain why the ISBN page is being treated as an image (when the text is selectable in a PDF reader) and maybe an optimisation to not take so long on the cover page. Last edited by kiwidude; 03-29-2011 at 09:15 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Extract ISBN from PDF?	mdroberts	Calibre	14	12-16-2016 08:32 AM
[Old Thread] Extract ISBN from file name	ChristianQ	Calibre	59	12-09-2015 06:08 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 01:27 PM
[Old Thread] Auto Extract ISBN-Feature request	UnraisedArc	Calibre	60	03-23-2011 10:31 AM
Displaying ISBN column in the main GUI	tilleydog	Library Management	26	02-25-2011 05:08 AM

03-29-2011, 06:33 PM	#16
sdspieg Connoisseur Posts: 54 Karma: 10 Join Date: Jun 2009 Device: Nook, Kindle 3	Super plugin! Thanks much... Calibre still keeps getting better and better... Any idea why it does not work on all files though? I have some books in my collection for which I CAN find the isbn number when I open the pdf file and look for it myself, but that the plugin didn't get right... Would you be interested in some books for which it doesn't work? Cheers, -Stephan

03-29-2011, 07:01 PM	#17
kiwidude Calibre Plugins Developer Posts: 4,684 Karma: 2162246 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Stephan - There are two reasons I can think of. The first is that your PDF is actually not containing text and instead just contains images of the content. If that is the case, there is nothing that can be done until the book is OCR'd. If that is not the case, then perhaps I need to adjust the regular expression that finds the ISBN numbers. If you have a PDF example, drop me a PM with a link to somewhere I can download it from.

03-29-2011, 08:54 PM	#22
kiwidude Calibre Plugins Developer Posts: 4,684 Karma: 2162246 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Thx for the extra info, I had edited my post while you were typing obviously I claim no credit for the regex or that loop of code applying the matches - as per my very first post I took this code from bazbar's script that people were using. I figured they had in turn built it based on all of the earlier versions so just assumed it was "proven". Since you have questioned it I will take a look. As for your question about searching from the end of the book. I assume this is a performance thing - and my answer remains the same as previously. I am at the mercey of the current implementation of the Calibre input converters. They do not stream the results to me, I cannot control their direction. I give them a path, and when they are "done" they give me a bunch of stuff representing the converted EPUB back. I hadn't appreciated any real performance issues with piggy-backing off this until just now when I tried a PDF which had graphics in it that Stephan sent me above, and now I see why some of you would like something faster! I will ask in the dev forum if there is any possibility of an overload or something that would support that - all we really want for this functionality is something like the first 10 pages and (maybe) last 5 of a book. I'm not optimistic they will consider this plugin worth the effort if it is anything but trivial to support it but you don't know unless you ask...

03-30-2011, 05:17 AM	#26
kiwidude Calibre Plugins Developer Posts: 4,684 Karma: 2162246 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@user_none - thanks for taking the time to take a look and explain. I really know zero about PDF conversion so appreciate the information you have given here and on the ticket. I fully expected that if it was anything other than a trivial change there would not be any interest in making a change to the code. I will try Kovid's suggestion of taking a look at the reflow.py stuff for processing PDF files. @drMerry - to be honest I don't have a massive interest in trying to support really badly OCR'd documents. I would much rather support the majority of what users are after which is an ISBN from valid documents (that they are less likely to be binning!). As performance is already an issue I don't intend to compound that. I have a suggestion from Kovid about some alternative code in Calibre to use for handling PDFs (part of the new in progress PDF engine) so I will give that a go and see what options if any I could introduce around it. I haven't looked at it yet but from what Kovid mentioned scanning the first 10 pages only will be easy to support, however I would guess scanning the last few might not be possible without scanning the whole thing. Still, at least offering that as a config option could help performance for the majority of docs where ISBN is at the front. In terms of your case (1) above of the ISBN immediately followed by a linefeed. If you can PM me a link to where I can download the doc I will give it a spin once I have changed the PDF handling code and see if we can handle that case.

03-30-2011, 05:21 AM	#27
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	@kiwidude I made a mistake in my regex. The new regex will not find isbn-numers with a length of 10 without any spaces, dots or dashes. This is because it checks 2 groups now, 10-24 positions and 1 last position. This must be 9-24 After I realized by your post, the regex was used based on proof of concept, I thought about some optimalization and I concluded this (will test it tonight at home) isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end) http://www.isbn-international.org/faqs/view/5#q_5 so you do not have to test for words like isbn or something like that. All extra test will consume exponentional time (you have to test for isbn AND isbn: AND 1sbn and.....) so I thought op this implementation it will search for digits (not 0-9, this is 10 tests, a digit is 1 test) it will search for white-characters (\s) so spaces are counted but also tabs are quick search: (97[89](\d{6}\|\d{9})[\dxX]) use match(0) optimal search (97[89][\d\s\-\.]{6,24}[\dxX]) use match(0) extended search like optimal, but you also add some of the mentioned characters to get more isbn numbers. This is a heavy implentation because you have to replace the numbers and afterwards have to test if you got a real isbn, otherwise you still have to tell it is not found. To be more sure you get isbn-numbers and not phone numbers, you can add some extra info like it may not be prefixed with the word tel or + or 0 But I think this is not needed as far as I know. I did a quick test with implementation 1 and 2 on a textfile (not with calibre). Both processed the file in a fraction of time compared to the original regex, and I got more numbers than in the original case. I post this on the forum because I hope there are people who can think of a (regular) event where my idea would fail and it would work if you just tested the if the word ISBN was available as prefix.

03-30-2011, 05:24 AM	#28
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	again we where working at the same time in this threat I can see your opinion about bad ocr. That is a valid opinion (off-course, opinions are always valid...) May be it is an idea for another plugin. I will think of that in the (near) future.

03-30-2011, 05:44 AM	#29
kiwidude Calibre Plugins Developer Posts: 4,684 Karma: 2162246 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@drMerry - yeah we crossed posts again. I have no objection at all to changing the regex or matching algorithm if you have a better one - as I said above I did not write that part of the code nor have I ever investigated all the variations. I just wanted to offer an easy to use wrapper around something that people could use within Calibre rather than running external scripts etc. So if you are willing to do the investigation can come up with an improved version that isn't noticeably slower than the existing one then by all means please send it to me. PM me your email address if you like and we can swap info there. My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.

Advert

Advert