I just ran across
this editorial on some of the issues with the current book database being created by Google, and the problems with relying on OCR for classification (in a /. posting), specifically as they might frustrate scholars trying to mine this data:
Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.
...
Here, too, Google has blamed the errors on the libraries and publishers who provided the books. But the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And BISAC classifications weren't in wide use before the last decade or two, so only Google can be responsible for their misapplications on numerous books published earlier than that: the 1919 edition of Robinson Crusoe assigned to Crafts & Hobbies or the 1907 edition of Sir Thomas Browne's Hydriotaphia: Urne-Buriall, which has been assigned to Gardening.