View Single Post
Old 05-05-2017, 08:37 PM   #1
jgray
Fanatic
jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.jgray ought to be getting tired of karma fortunes by now.
 
Posts: 554
Karma: 2928497
Join Date: Mar 2008
Device: Clara 2E & Sage
Public Domain books inacessable at Hathi Trust

If this has been discussed before, feel free to delete this post. I did a search of the forums, but didn't find anything.

I don't know what type of deal that Hathi Trust and the associated universities have with Google, but I find it very annoying that books that are clearly labeled "Public Domain" cannot be downloaded without requiring a login from a specific list of universities. Sure, you can view the book online and download one page at a time, but I don't call that proper access for public domain material.

After quite a bit of searching for a downloadable copy of a particular book, I stumbled across something called "Hathi Download Helper".

https://sourceforge.net/projects/hathidownloadhelper/

This program automates the tedious process of downloading one page at a time. It them assembles everything into one PDF for you.

Once installed, set your defaults under the "Options" menu. What works best is to set the download option to "1 pdf per page, searchable text". Also set the "Destination folder" and check "create pdf book after download".

To download a book, paste the URL for that book into the appropriate box in the program and click "Get book info". Next, click "Start download".

Due to the Hathi web server throttling, the program can only download a few pages at a time, then wait five minutes. Once all pages are downloaded, the program will assemble them into a single PDF, with the OCR'ed text included.

What started me on this journey was my mother's cookbook, "Rumford Complete Cook Book", 1950 edition. My sister has what remains of my mother's cookbook, but the covers and several pages are missing. The cookbook saw quite a bit of use over the years.

I did find the 1947 edition online and downloadable, but the recipes were somewhat different from my mother's copy. I was glad that I found the 1950 edition, but then was very annoyed that I could not download this public domain book. I wanted to give a copy to my sister for use on her tablet, and spare the well-used original. Finding the download helper program solved the problem.

When you view a book at Hathi Trust with their online viewer, you will see the typical "Scanned by Google" watermark on every page. It gets worse. On the downloaded PDF pages, Hathi adds another watermark on every page. Well, there are programs that will remove these watermarks, which interfere with the auto-crop feature in many PDF reader programs.

Since the Rumford cookbook was clearly listed as public domain, I removed the watermarks, then spent a few hours creating a table of contents, listing every recipe and other pertinent pages. If you want a copy, you can find it on my web page: www.zianet.com/jgray/

Joe

Last edited by issybird; 05-06-2017 at 04:16 PM. Reason: Post restored.
jgray is offline   Reply With Quote