Quote:
Originally Posted by myce
Extract ISBN is really great at extracting ISBNs from the books text. But this made it stumble.
From "The Definitive Guide to How Computers Do Math: Featuring the Virtual Diy Calculator" page 2:
Code:
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data is available.
ISBN-13 978-0471-73278-5
ISBN-10 0-471-73278-8
results in the log file:
Code:
Invalid ISBN match: 877-762-2974
Valid ISBN10: 3175723993
Invalid ISBN match: 317-572-4002
Invalid ISBN match: -13 978-0471-73278
Invalid ISBN match: -10 0-471-73278-8
I understand that it detects 3175723993 as a valid ISBN. But maybe you could make it reparse substrings if the number it found is longer than 10/13 digits. Or maybe even look for the string ISBN.{,3}1[03] explicitly and give the numbers in it's vicinity higher precedence.
|
IMHO only 1 parse rule at a time should be used. the last 2 broke that rule and therefore failed to find a valid ISBN. Space or Dash, not both in the same substring
once found (10 character ISBN 10), the check digit should validate (the NANP phone number should fail in near 100% of the cases the FAX number is one of those
edge cases )