View Single Post
Old 03-31-2011, 12:39 PM   #37
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by kiwidude View Post
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.
I think so.
I had a pdf of 700 pages.
163 MB
Took me more than half an hour to know your (also with my regex) tagger could not find an isbn

Quote:
Originally Posted by kiwidude View Post
In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.
I mean that a book with a lot of numbers have more change to have a number that is conform ISBN-standard. So this could give a false positive.

Quote:
Originally Posted by kiwidude View Post
As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker . If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.
I often see pdf-files with isbn crossed over the front page (because the ocr can not handle the forntpage/picture.) Rest of document is good in this case.
This is off-course a ocr error and I can understand you do not want to invest in bad ocr. Because I've seen it often in books with isbn on the front cover, I myself should add the newline option. To test isbn numbers and try to recover a good isbn outof iop830l|Ix would be something else.
On the other hand, If I do not add the \s in the regex, I can not retrieve isbn numbers with the last number right before a linefeed.

@your opinion about 98% and a lot of sub-options:
agree
drMerry is offline   Reply With Quote