Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 03-30-2011, 10:41 PM   #31
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by kiwidude View Post
My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.
There is a 'check_isbn' function that is already in use in the various calibre metadata plugins that do some validation on whether a specific string of numbers is truly an ISBN vs a random string of numbers like a phone number. These get used before the metadata plugins send an ISBN to a metadata provider, but they should be good for this too.

from calibre.ebooks.metadata import check_isbn
ldolse is offline   Reply With Quote
Old 03-31-2011, 04:49 AM   #32
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@Idolse - thx, yes I do indeed already make use of that in the plugin, so if that should be a sufficient failsafe then that is good news
kiwidude is offline   Reply With Quote
Old 03-31-2011, 09:16 AM   #33
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
An update.

Check isbn is indeed used and functions well I see.
I have made this version.
Works 2 times faster than original.
I scanned 600 epubs that had no isbn (Not checked if there was ISBN inside it)
I got 100 new ISBN-nrs

Seems nice, BUT:
I had 2 non- (but valid) isbn-nr's
There were isbn-nr's in the file. The numbers I found, where there because of a bad epub conversion.

You can not use \d. you have to use 0-9 because with \d calibre freezes on some files.
I have some trouble with multi-line

I can detect:

NUR 123
ISBN 1234567890

and

NUR 123
ISBN 123 456.78
9

0

and

123 456.789

0

but NOT
NUR 123
1234567890

In this case 1231234567 is returned as posible isbn and found bad
(EDIT: ADDED 7, Off-course I do not get 213123456..)

Maybe someone can find a solution?

I build in some restrictions to avoid some problems
13 or 10 0's is a valid isbn, but you don't want to extract that
I also test isbn 13-numbers if they start with 978 or 979. If not, I do not even test validity.

I'm a bad programmer in case of changelog, made some log info
I changed extract_isbn_code
Added strings on top of the file
changed the regex
changed loor_for_isbn_in_text

I'm not a py programmer so I someone knows a better way to do the txt.replace (strip all whitespaces (including \n and \r) and removing - and .)

At the other hand, I have sometimes put an isbn including - into the meta-info and calibre updated the info itself. so maybe only \n\r needs to be removed?
(in this case you don't even have to (and can't) test for 10 / 13 isbn. So it should go even faster

I also included a pdf with legal isbn-ranges. If you add this check, next to the validity check, you're 99.99999% sure it is an ISBN-number
Attached Files
File Type: pdf RangeMessage.pdf (1.27 MB, 180 views)

Last edited by kiwidude; 05-28-2012 at 11:34 AM. Reason: Remove attachment so others do not get confused
drMerry is offline   Reply With Quote
Old 03-31-2011, 11:11 AM   #34
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
I just tested some new ebooks
PDF is still extreme slow.
The pdf-slowness is because of the pdftohtml process. This uses on all my pc's 50% of my cpu (1 complete core). Maybe a bug in calibre?

There will be more errors if u try to index an math-book or a technical manual (Because of the large number of large numbers)
But that will be a problem for a minority of users (including me).
Maybe you can add an option to only check numbers with isbn notations in front (like it is at this moment)

Last edited by drMerry; 03-31-2011 at 11:14 AM. Reason: pdf-error info
drMerry is offline   Reply With Quote
Old 03-31-2011, 12:17 PM   #35
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,269
Karma: 6022733
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)

I backed in the check digit by trying [0-9X] until Calibre gave me a Green ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.
theducks is offline   Reply With Quote
Old 03-31-2011, 12:37 PM   #36
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.

In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.

As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker . If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.

I don't want to have a whole bunch of options on this plugin, it is why I have resisted putting a menu onto it as there are too many permutations. I think of how I see people using it - they will give it a one click shot at trying to find an ISBN, and after that they will use a metadata download type lookup based on title/author matching. I really don't see them wasting a lot of time bothering making multiple attempts on the same book using different options? If it fails and they believe there "really must" be an ISBN in there, they will view the book and type it in if it means that much to them (which they will have to do for any graphical based PDFs anyways).

However that is just my opinion on how I see people using it. If it handles 98% of the book ISBNs out there that is still an improvement without it.
kiwidude is offline   Reply With Quote
Old 03-31-2011, 01:39 PM   #37
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by kiwidude View Post
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.
I think so.
I had a pdf of 700 pages.
163 MB
Took me more than half an hour to know your (also with my regex) tagger could not find an isbn

Quote:
Originally Posted by kiwidude View Post
In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.
I mean that a book with a lot of numbers have more change to have a number that is conform ISBN-standard. So this could give a false positive.

Quote:
Originally Posted by kiwidude View Post
As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker . If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.
I often see pdf-files with isbn crossed over the front page (because the ocr can not handle the forntpage/picture.) Rest of document is good in this case.
This is off-course a ocr error and I can understand you do not want to invest in bad ocr. Because I've seen it often in books with isbn on the front cover, I myself should add the newline option. To test isbn numbers and try to recover a good isbn outof iop830l|Ix would be something else.
On the other hand, If I do not add the \s in the regex, I can not retrieve isbn numbers with the last number right before a linefeed.

@your opinion about 98% and a lot of sub-options:
agree
drMerry is offline   Reply With Quote
Old 03-31-2011, 01:42 PM   #38
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by theducks View Post
Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)
That's a nasty one. If this is the case for e lot of books of this period, it would be a drawback.

Quote:
Originally Posted by theducks View Post
I backed in the check digit by trying [0-9X] until Calibre gave me a Green ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.
I can confirm I also have never seen it mixed. You have not seen dots between them?
drMerry is offline   Reply With Quote
Old 03-31-2011, 03:08 PM   #39
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,269
Karma: 6022733
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by drMerry View Post
You have not seen dots between them?
You really expect me to remember a possible 1 or 2 out of 900+

All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results (90% passed)

All checking the (failed) entry against the book printing was it was not a 'fat fingering' problem

I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.
theducks is offline   Reply With Quote
Old 03-31-2011, 07:37 PM   #40
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by theducks View Post
You really expect me to remember a possible 1 or 2 out of 900+
You don't?
not a real e-book reader than

Quote:
Originally Posted by theducks View Post
All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results (90% passed)
...
I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.
I myself am not sure about the fact if I've seen dots.
spaces and --- are sure.
Every new added character will slow down the process a bit (noticeable on large number of pages to be scanned).

But I think for speeding up the process we will have to wait for the mentioned replacement of pdftohtml
drMerry is offline   Reply With Quote
Old 04-02-2011, 10:36 AM   #41
Loeffel
Connoisseur
Loeffel began at the beginning.
 
Loeffel's Avatar
 
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
Hi,
normally the extraction runs fine, but if I try to scan many ebooks at once with the auto feature, the plugin hangs and the only way to go on is to kill Calibre (don't know at which number of books).
First I thought, it is ok, but then I saw that always, when this happens, the plugin doesn't go on, it stops at the first ebook. Scan this book only or just a few, no problem.
I can scan about 300 books at once without problems.
Loeffel is offline   Reply With Quote
Old 04-02-2011, 10:45 AM   #42
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@Loeffel - I would be suprised if it is "hanging", I think it more likely you are hitting some large PDFs that it is struggling with time to analyse. If you run the plugin in debug mode (Ctrl+Shift+R) you should see it continuing to display output as the input converters do their thing.
kiwidude is offline   Reply With Quote
Old 04-02-2011, 12:35 PM   #43
Loeffel
Connoisseur
Loeffel began at the beginning.
 
Loeffel's Avatar
 
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
I have no real big ebooks, but some in different formats perhaps that's the problem. Is there any way to tell the plugin just to search the first format found?
Loeffel is offline   Reply With Quote
Old 04-02-2011, 12:58 PM   #44
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Go to the customisation for the plugin, and you can set its behaviour. However I think by default the alternate search is set to only check the first format in preferred input order.

It doesn't necessarily have to be a massive PDF, but just PDFs in general will slow it down, by how much depends on the content I think moreso than size. If it has lots of graphics I think that makes it grind rather slowly. There's a few posts in this thread about it if you read back. Now 0.7.53 is out I can start experimenting more seriously with the "first 10 pages/last 5 pages" approach to scanning which hopefully will improve things.
kiwidude is offline   Reply With Quote
Old 04-02-2011, 09:39 PM   #45
Loeffel
Connoisseur
Loeffel began at the beginning.
 
Loeffel's Avatar
 
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
I found it. I will let it run while I'm sleeping. I will see what happened when I come back. If it just looks like or if it really f...s up.
Loeffel is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Extract ISBN from file name ChristianQ Calibre 56 05-20-2012 10:59 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 01:27 PM
[Old Thread] Auto Extract ISBN-Feature request UnraisedArc Calibre 60 03-23-2011 10:31 AM
Displaying ISBN column in the main GUI tilleydog Library Management 26 02-25-2011 05:08 AM
Extract ISBN from PDF? mdroberts Calibre 10 12-15-2009 02:35 AM


All times are GMT -4. The time now is 04:38 PM.


MobileRead.com is a privately owned, operated and funded community.