06-13-2008, 12:49 PM | #1 |
Groupie
Posts: 189
Karma: 2190
Join Date: Aug 2007
Device: Sony PRS-500
|
WSJ Eastern Edition from Public Library
Hello there,
I would like to know if there is any tool out there that can cleanup a bunch of HTML files quickly. This include removing buttons, search box, drop down menus, etc. Then combines several html files into 1 single page. The reason I am asking is that my local public library gives me access to ProQuest which have access to quite a few newspapers. This included full access to for instant, Los Angeles Time and Wall Street Journal Eastern Edition. I have been trying to put WSJ on the Sony PRS500 with some success. However, it required some manual editing of HTML pages and download graphics on the page. Here are two options that available to me. 1) Email 50 pages at a time of any selected articles. This email will be in HTML format with no buttons, links, and drop down menus. Usually, this come out to 3-4 separate htmls since there are roughly 150 articles daily. 2) Email a index pages with the links to all the articles. For option 1. Here is the workflow to get it into my reader. 1) Email myself the pages and save it to disk. 2) Perform a small amount of editing. Remove header, footer, and space filler using notepad. 3) Reopen the file using Firefox and save it to disk using "ScrapBook" plugin. Which basically convert all the links to local file. 4) Convert the HTML page into LRF using Calibre and upload to Reader. 5) Repeat step 1-4 for all the pages (This usually take me 15 to 20 minutes). For option 2. Here is what I have so far. 1) Email myself the index page and then save it to disk. 2) Open the page with Firefox and "ScrapBook" Plugin to download all the pages that was linked to this index files. This will be a bunch of separates HTML files and their associates images. This will results in many html pages. With this options, the articles are not just text and images. It includes buttons, search box, and drop down menu. Hence, once converted it will look pretty untidy. I have tried "HTML tidy" but that doesn't really work. Also, I can't seem to combine them into a single ebook using Calibre. With option 1, I can get the page on to the reader quickly. However, all the pages are in single document. Hence, I can't jump from one article to the next. Requiring sequential navigation. Option 2 have more potential but required editing of over 150 html pages. Please, let me know what my options are. Thank you!!! Last edited by OrcaBlue; 06-13-2008 at 02:01 PM. Reason: typo |
06-15-2008, 11:14 PM | #2 |
Zealot
Posts: 138
Karma: 372
Join Date: Apr 2008
Location: New York, NY
Device: Sony PRS-600, Nook Color, iPad
|
This sounds like it's begging to be scripted. Take a look at how your library does authentication to the web site. It's probably a form post with a session cookie returned. Using curl, you can login, and fetch the pages. Grep sed and awk are your friends. You can do streaming editing with them. For a little more work, you can do the whole thing in perl. You can also finish off by converting the html to lrf.
- Ed |
Advert | |
|
06-20-2008, 02:31 AM | #3 |
Groupie
Posts: 189
Karma: 2190
Join Date: Aug 2007
Device: Sony PRS-500
|
Thanks for the suggestion, edembowski.
Yes, I thought about using sed/awk or Perl to accomplish this. I will need to spend sometimes learning how to use these tools first. In order to access ProQuest, I have to login via my library website. You're are right about it using cookies. When I directly fetch the site via RSS Feed which is an available options, I was prompt for a login and username which I don't have. Hence, the only to access the document is via my library's redirection. For now, I am abled to get all the articles and create a table of content using Calibre. This include removing excess pages elements that I don't want manually. Then download a local copy and combine them using ScrapBook (Mozilla's Plugins). Also, add a keyword so calibre can search at the beginning of each article and create TOC. It is a clumsy hack. Also, due to more than 150 items in the TOC; it takes at least a few minutes to jump from document to TOC and then back. This is a some what clumsy hack. In the future attempt, I might reduce the number of articles that I want to download dued to the fact that I can't really read all of them in one day. In any case, thanks for the tips. I will post any additional progress later. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Supporting the Public Library | switchman2210 | General Discussions | 5 | 10-20-2010 11:44 PM |
How to move public library book from ADE to Sony Library? | mom2three | Sony Reader | 3 | 06-30-2010 05:26 AM |
Public Library | GameMisconduct | Ectaco jetBook | 21 | 04-11-2010 07:46 PM |
WSJ-Sony won't make Xmas for Daily Edition | advocate2 | News | 1 | 11-18-2009 06:52 PM |
Best device for public library use? | applecore | Which one should I buy? | 10 | 03-01-2009 12:27 PM |