View Single Post
Old 03-30-2011, 04:21 AM   #27
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
@kiwidude

I made a mistake in my regex.
The new regex will not find isbn-numers with a length of 10 without any spaces, dots or dashes. This is because it checks 2 groups now, 10-24 positions and 1 last position. This must be 9-24

After I realized by your post, the regex was used based on proof of concept, I thought about some optimalization and I concluded this (will test it tonight at home)

isbn will always start with 978 or 979 and will have 10 or 13 digits (or 1 less and x at the end)
http://www.isbn-international.org/faqs/view/5#q_5

so you do not have to test for words like isbn or something like that. All extra test will consume exponentional time (you have to test for isbn AND isbn: AND 1sbn and.....)

so I thought op this implementation
it will search for digits (not 0-9, this is 10 tests, a digit is 1 test)
it will search for white-characters (\s) so spaces are counted but also tabs are
quick search: (97[89](\d{6}|\d{9})[\dxX]) use match(0)
optimal search (97[89][\d\s\-\.]{6,24}[\dxX]) use match(0)
extended search like optimal, but you also add some of the mentioned characters to get more isbn numbers. This is a heavy implentation because you have to replace the numbers and afterwards have to test if you got a real isbn, otherwise you still have to tell it is not found.

To be more sure you get isbn-numbers and not phone numbers, you can add some extra info like it may not be prefixed with the word tel or + or 0
But I think this is not needed as far as I know. I did a quick test with implementation 1 and 2 on a textfile (not with calibre).
Both processed the file in a fraction of time compared to the original regex, and I got more numbers than in the original case.

I post this on the forum because I hope there are people who can think of a (regular) event where my idea would fail and it would work if you just tested the if the word ISBN was available as prefix.
drMerry is offline   Reply With Quote