[GUI Plugin] Extract ISBN - Page 15

jlutes · 11-11-2011, 08:52 AM

I was afraid that was the case with the filename. As for priority scanning the metadata tags, my first thought would be to make it user-controllable via an option. I would say set the default action as "look in metedata if there is no match elsewhere" but I could see where someone might want to reverse that under certain circumstances.

capnm · 11-12-2011, 10:08 AM

Quote:

Originally Posted by kiwidude

I can see a case for wanting the Extract ISBN plugin to attempt to read it from metadata.

When this plugin was in its infancy, I really wanted to be able to pull the ISBN from the in-book metadata either as a fallback, or to compare, and was frustrated that there was no easy way to, in an already added book, get info from the book's internal metadata into the calibre database.

Then I decided that the garbage level, even in commercial ebooks, was just too high, and maybe ignoring the in-book metadata wasn't so bad after all (and I'm a data miser -- I hate ignoring/discarding potentially useful data).

And while this case:
no ISBN can be found in the book content, but one is present in the metadata.
is rare, this case:
no ISBN can be found in the book content, but an accurate one is present in the metadata.
is really rare.

Unfortunately, this case:
an accurate ISBN can be found in the book content, but a different one is present in the metadata.
is really common, making this:
the ISBN extracted from the book is incorrect, but there is a correct one set in the metadata.
pretty impossible to reliably detect.

I don't mind so much when no ISBN can be found in the content, but these:
the ISBN extracted from the book is for a different book (such as an advertisement for a related book for the publisher).
really nag me, because they're stealthy errors.

Maybe the next step is a Verify ISBN plugin that would check the author/title/ISBN against one of the ISBN pools and flag mismatches and not-founds ....

capnm · 11-12-2011, 10:44 AM

@kiwidude:
I still get the occasional epub where Extract ISBN misses, bafflingly, but they're rare enough I just shrug, manually find the ISBN in the text, copy, paste, and move on.

I'll PM you a sample, to look at if you're curious, to ignore if you're busy

Thanks.

kiwidude · 11-12-2011, 01:48 PM

Changes in this release:

Exclude leading spaces before the ISBN number which prevented some valid ISBNs from being detected.

@capnm - this fixed the issue with the epub you sent me, thx.

Nyssa · 12-30-2011, 01:33 PM

Will this overwrite an isbn that is already there (say from downloading metadata) or does it just add the extracted one to the others?

kiwidude · 12-30-2011, 01:50 PM

@Nyssa - it will always overwrite any existing ISBN if extract ISBN finds a valid one.

Nyssa · 12-30-2011, 02:25 PM

Okay. Thank you.

greatdragon · 03-13-2012, 01:28 AM

Quote:

Originally Posted by sdspieg

Super plugin! Thanks much... Calibre still keeps getting better and better...

Any idea why it does not work on all files though? I have some books in my collection for which I CAN find the isbn number when I open the pdf file and look for it myself, but that the plugin didn't get right... Would you be interested in some books for which it doesn't work?

Cheers,

-Stephan

there is a slight mod to the regex you are using and your work load the "\s*" at the start of your regex is unneeded it and in some of my formatted PDF caused an issue as there was no valid spaces so the regex came up false. also removing offers a small performance boost not huge but with regexes any thing you can trim out saves cpu

but great plugin keep up the good work and if I notice any other improvement I will let you know

kiwidude · 03-13-2012, 05:20 AM

@greatdragon - first the disclaimer - I am not a regex guru, I know the basics to get by. However note that it is a \s* not \s+, so it should make no difference to your pdf, as it is 0 or more matches?

I can't remember all the reasons why it is there, there have been many iterations of this plugin over its lifetime to get to where it is today. It may have been to catch some case I can't recall. Or it may have been to "soak up" leading spaces to prevent a document with loads of consecutive spaces reporting as matches (since space is a valid character in the next part of the expression).

Now if others who know far more about me than regex agree with your finding then I can look to change it, but I am firmly in the "if it ain't broke don't fix it" camp.

Performance isn't a reason if the change were to reduce its effectiveness for some reason, particularly since it runs as a background job.

Joanna · 05-27-2012, 07:09 PM

I have just switched to a new installation of Calibre Portable and now, for some reason, I get an error everytime I launch Extract ISBN on a .pdf file ("access violation"). The plugin works impeccably with epub files, no other errors occured in Calibre. Any ideas? All help appreciated

.

Dinesh.kaundal · 05-28-2012, 02:25 AM

When I upgraded Calibre from 0.8.52 to 0.8.53
ISBN Extract Plugin as when executed it crashes calibre
my system details are as
OS Windows 7 x64 SP1

Again Rolled Back to calibre-0.8.52 it working fine

Regards

Dinesh

kiwidude · 05-28-2012, 03:07 AM

Normally I run from the latest source code and the last binary install I had done was 0.8.52 (everything working fine). I just installed the binaries for 0.8.53, then I also find that calibre crashes with 0.8.53 (but only when scanning PDF files.)

Which implies that perhaps Kovid "broke something" in the PDF code (which being C++ is the most likely thing to cause such a crash).

@Kovid - here is what my code does where I believe it is crashing:

Spoiler:

kovidgoyal · 05-28-2012, 07:54 AM

@kiwidude: 0.8.53 updated to poppler 0.20 which is probably why its crashing. I've committed some code to enable the xml output from pdftohtml use that instead, it will prevent this kind of crash in the future.

pdftohtml(..., as_xml=True)

kiwidude · 05-28-2012, 08:20 AM

@Kovid - thx for that, though if I understand you correctly you are saying to use calibre's existing PDF engine via pdftohtml rather than the poppler stuff via pdfreflow, right?

As IIRC pdftohtml is what this plugin originally used, but we found it to be very, very slow (particularly on graphical pdfs). Whereas using pdfreflow allowed the plugin to scan subsets of only the front few and last few pages.

No chance of the pdfreflow stuff getting fixed?

kovidgoyal · 05-28-2012, 09:07 AM

It's not a priority for me, the parts of the poppler api that pdfreflow uses are not stable, they change with pretty much every poppler 0.x release, which makes maintaining them a pain. I am switching the new pdf engine to use pdftohtml -xml which produces the same kind of output as pdfreflow, the upside being that I no longer have to maintain pdfreflow's C++ code. The downside, from your perspective, is that pdftohtml does not support specifying a pdf page range for conversion. You have four choices:

1) Maintain pdfreflow yourself, i'm happy to accept patches.

2) Ask the poppler people to implement page ranges for pdftohtml

3) Use another pdf library (calibre has both podofo and pypdf) to first extract the relevant pages and then run pdftohtml on them.

4) Live with the reduced performance

11-12-2011, 01:48 PM	#214
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v1.4.1 Released Changes in this release: Exclude leading spaces before the ISBN number which prevented some valid ISBNs from being detected. @capnm - this fixed the issue with the epub you sent me, thx.

12-30-2011, 01:33 PM	#215
Nyssa Series Addict Posts: 6,180 Karma: 167189477 Join Date: Dec 2010 Location: Florida, USA Device: Kindle Paperwhite (2nd Gen)	Question: Will this overwrite an isbn that is already there (say from downloading metadata) or does it just add the extracted one to the others? Last edited by Nyssa; 12-30-2011 at 02:25 PM. Reason: typo

05-28-2012, 02:25 AM	#221
Dinesh.kaundal Junior Member Posts: 1 Karma: 10 Join Date: May 2012 Location: Solan, Himachal Pradesh, Bharat Varsh ( India ) Device: none	ISBN Extract Plugin (Version 1.4.1) as when executed it crashes calibre When I upgraded Calibre from 0.8.52 to 0.8.53 ISBN Extract Plugin as when executed it crashes calibre my system details are as OS Windows 7 x64 SP1 Again Rolled Back to calibre-0.8.52 it working fine Regards Dinesh Last edited by Dinesh.kaundal; 05-28-2012 at 02:36 AM. Reason: Update

05-28-2012, 03:07 AM	#222
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Normally I run from the latest source code and the last binary install I had done was 0.8.52 (everything working fine). I just installed the binaries for 0.8.53, then I also find that calibre crashes with 0.8.53 (but only when scanning PDF files.) Which implies that perhaps Kovid "broke something" in the PDF code (which being C++ is the most likely thing to cause such a crash). @Kovid - here is what my code does where I believe it is crashing: Spoiler: Code: def _read_pdf_txt(self, book_path, start_page, end_page): from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] with open(book_path, 'rb') as stream: tdir = PersistentTemporaryDirectory('_isbn') with CurrentDir(tdir): pages = pdfreflow.reflow(stream.read(), start_page, end_page) with open('index.xml', 'rb') as f: xml = f.read() #open('E:\\%d.xml'%start_page,'wb').write(xml) root = etree.fromstring(clean_ascii_chars(xml)) txt = etree.tostring(root, method='text', encoding=unicode) return (pages, txt)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Extract ISBN from PDF?	mdroberts	Calibre	14	12-16-2016 07:32 AM
[Old Thread] Extract ISBN from file name	ChristianQ	Calibre	59	12-09-2015 05:08 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
[Old Thread] Auto Extract ISBN-Feature request	UnraisedArc	Calibre	60	03-23-2011 09:31 AM
Displaying ISBN column in the main GUI	tilleydog	Library Management	26	02-25-2011 04:08 AM

11-11-2011, 08:52 AM	#211
jlutes Connoisseur Posts: 52 Karma: 12 Join Date: Jul 2011 Device: none	I was afraid that was the case with the filename. As for priority scanning the metadata tags, my first thought would be to make it user-controllable via an option. I would say set the default action as "look in metedata if there is no match elsewhere" but I could see where someone might want to reverse that under certain circumstances.

11-12-2011, 10:44 AM	#213
capnm Groupie Posts: 156 Karma: 10001 Join Date: Feb 2011 Device: sony	@kiwidude: I still get the occasional epub where Extract ISBN misses, bafflingly, but they're rare enough I just shrug, manually find the ISBN in the text, copy, paste, and move on. I'll PM you a sample, to look at if you're curious, to ignore if you're busy Thanks.

12-30-2011, 01:50 PM	#216
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Nyssa - it will always overwrite any existing ISBN if extract ISBN finds a valid one.

12-30-2011, 02:25 PM	#217
Nyssa Series Addict Posts: 6,180 Karma: 167189477 Join Date: Dec 2010 Location: Florida, USA Device: Kindle Paperwhite (2nd Gen)	Okay. Thank you.

03-13-2012, 05:20 AM	#219
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@greatdragon - first the disclaimer - I am not a regex guru, I know the basics to get by. However note that it is a \s* not \s+, so it should make no difference to your pdf, as it is 0 or more matches? I can't remember all the reasons why it is there, there have been many iterations of this plugin over its lifetime to get to where it is today. It may have been to catch some case I can't recall. Or it may have been to "soak up" leading spaces to prevent a document with loads of consecutive spaces reporting as matches (since space is a valid character in the next part of the expression). Now if others who know far more about me than regex agree with your finding then I can look to change it, but I am firmly in the "if it ain't broke don't fix it" camp. Performance isn't a reason if the change were to reduce its effectiveness for some reason, particularly since it runs as a background job.

05-27-2012, 07:09 PM	#220
Joanna Groupie Posts: 199 Karma: 76476 Join Date: Feb 2012 Location: Poland Device: none	I have just switched to a new installation of Calibre Portable and now, for some reason, I get an error everytime I launch Extract ISBN on a .pdf file ("access violation"). The plugin works impeccably with epub files, no other errors occured in Calibre. Any ideas? All help appreciated .

05-28-2012, 07:54 AM	#223
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@kiwidude: 0.8.53 updated to poppler 0.20 which is probably why its crashing. I've committed some code to enable the xml output from pdftohtml use that instead, it will prevent this kind of crash in the future. pdftohtml(..., as_xml=True)

05-28-2012, 08:20 AM	#224
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Kovid - thx for that, though if I understand you correctly you are saying to use calibre's existing PDF engine via pdftohtml rather than the poppler stuff via pdfreflow, right? As IIRC pdftohtml is what this plugin originally used, but we found it to be very, very slow (particularly on graphical pdfs). Whereas using pdfreflow allowed the plugin to scan subsets of only the front few and last few pages. No chance of the pdfreflow stuff getting fixed?

05-28-2012, 09:07 AM	#225
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It's not a priority for me, the parts of the poppler api that pdfreflow uses are not stable, they change with pretty much every poppler 0.x release, which makes maintaining them a pain. I am switching the new pdf engine to use pdftohtml -xml which produces the same kind of output as pdfreflow, the upside being that I no longer have to maintain pdfreflow's C++ code. The downside, from your perspective, is that pdftohtml does not support specifying a pdf page range for conversion. You have four choices: 1) Maintain pdfreflow yourself, i'm happy to accept patches. 2) Ask the poppler people to implement page ranges for pdftohtml 3) Use another pdf library (calibre has both podofo and pypdf) to first extract the relevant pages and then run pdftohtml on them. 4) Live with the reduced performance

Advert

Advert