WSJ Eastern Edition from Public Library
Hello there,
I would like to know if there is any tool out there that can cleanup a bunch of HTML files quickly. This include removing buttons, search box, drop down menus, etc. Then combines several html files into 1 single page.
The reason I am asking is that my local public library gives me access to ProQuest which have access to quite a few newspapers. This included full access to for instant, Los Angeles Time and Wall Street Journal Eastern Edition. I have been trying to put WSJ on the Sony PRS500 with some success. However, it required some manual editing of HTML pages and download graphics on the page.
Here are two options that available to me.
1) Email 50 pages at a time of any selected articles. This email will be in HTML format with no buttons, links, and drop down menus. Usually, this come out to 3-4 separate htmls since there are roughly 150 articles daily.
2) Email a index pages with the links to all the articles.
For option 1. Here is the workflow to get it into my reader.
1) Email myself the pages and save it to disk.
2) Perform a small amount of editing. Remove header, footer, and space filler using notepad.
3) Reopen the file using Firefox and save it to disk using "ScrapBook" plugin. Which basically convert all the links to local file.
4) Convert the HTML page into LRF using Calibre and upload to Reader.
5) Repeat step 1-4 for all the pages (This usually take me 15 to 20 minutes).
For option 2. Here is what I have so far.
1) Email myself the index page and then save it to disk.
2) Open the page with Firefox and "ScrapBook" Plugin to download all the pages that was linked to this index files. This will be a bunch of separates HTML files and their associates images.
This will results in many html pages. With this options, the articles are not just text and images. It includes buttons, search box, and drop down menu. Hence, once converted it will look pretty untidy. I have tried "HTML tidy" but that doesn't really work. Also, I can't seem to combine them into a single ebook using Calibre.
With option 1, I can get the page on to the reader quickly. However, all the pages are in single document. Hence, I can't jump from one article to the next. Requiring sequential navigation. Option 2 have more potential but required editing of over 150 html pages.
Please, let me know what my options are. Thank you!!!
Last edited by OrcaBlue; 06-13-2008 at 02:01 PM.
Reason: typo
|