Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 06-13-2008, 12:49 PM   #1
OrcaBlue
Groupie
OrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it is
 
Posts: 189
Karma: 2190
Join Date: Aug 2007
Device: Sony PRS-500
Question WSJ Eastern Edition from Public Library

Hello there,

I would like to know if there is any tool out there that can cleanup a bunch of HTML files quickly. This include removing buttons, search box, drop down menus, etc. Then combines several html files into 1 single page.

The reason I am asking is that my local public library gives me access to ProQuest which have access to quite a few newspapers. This included full access to for instant, Los Angeles Time and Wall Street Journal Eastern Edition. I have been trying to put WSJ on the Sony PRS500 with some success. However, it required some manual editing of HTML pages and download graphics on the page.

Here are two options that available to me.

1) Email 50 pages at a time of any selected articles. This email will be in HTML format with no buttons, links, and drop down menus. Usually, this come out to 3-4 separate htmls since there are roughly 150 articles daily.

2) Email a index pages with the links to all the articles.

For option 1. Here is the workflow to get it into my reader.

1) Email myself the pages and save it to disk.

2) Perform a small amount of editing. Remove header, footer, and space filler using notepad.

3) Reopen the file using Firefox and save it to disk using "ScrapBook" plugin. Which basically convert all the links to local file.

4) Convert the HTML page into LRF using Calibre and upload to Reader.

5) Repeat step 1-4 for all the pages (This usually take me 15 to 20 minutes).

For option 2. Here is what I have so far.

1) Email myself the index page and then save it to disk.

2) Open the page with Firefox and "ScrapBook" Plugin to download all the pages that was linked to this index files. This will be a bunch of separates HTML files and their associates images.

This will results in many html pages. With this options, the articles are not just text and images. It includes buttons, search box, and drop down menu. Hence, once converted it will look pretty untidy. I have tried "HTML tidy" but that doesn't really work. Also, I can't seem to combine them into a single ebook using Calibre.

With option 1, I can get the page on to the reader quickly. However, all the pages are in single document. Hence, I can't jump from one article to the next. Requiring sequential navigation. Option 2 have more potential but required editing of over 150 html pages.

Please, let me know what my options are. Thank you!!!

Last edited by OrcaBlue; 06-13-2008 at 02:01 PM. Reason: typo
OrcaBlue is offline   Reply With Quote
Old 06-15-2008, 11:14 PM   #2
edembowski
Zealot
edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.
 
edembowski's Avatar
 
Posts: 138
Karma: 372
Join Date: Apr 2008
Location: New York, NY
Device: Sony PRS-600, Nook Color, iPad
This sounds like it's begging to be scripted. Take a look at how your library does authentication to the web site. It's probably a form post with a session cookie returned. Using curl, you can login, and fetch the pages. Grep sed and awk are your friends. You can do streaming editing with them. For a little more work, you can do the whole thing in perl. You can also finish off by converting the html to lrf.

- Ed
edembowski is offline   Reply With Quote
Advert
Old 06-20-2008, 02:31 AM   #3
OrcaBlue
Groupie
OrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it isOrcaBlue knows what time it is
 
Posts: 189
Karma: 2190
Join Date: Aug 2007
Device: Sony PRS-500
Thanks for the suggestion, edembowski.

Yes, I thought about using sed/awk or Perl to accomplish this. I will need to spend sometimes learning how to use these tools first. In order to access ProQuest, I have to login via my library website. You're are right about it using cookies. When I directly fetch the site via RSS Feed which is an available options, I was prompt for a login and username which I don't have. Hence, the only to access the document is via my library's redirection.

For now, I am abled to get all the articles and create a table of content using Calibre. This include removing excess pages elements that I don't want manually. Then download a local copy and combine them using ScrapBook (Mozilla's Plugins). Also, add a keyword so calibre can search at the beginning of each article and create TOC. It is a clumsy hack. Also, due to more than 150 items in the TOC; it takes at least a few minutes to jump from document to TOC and then back. This is a some what clumsy hack.

In the future attempt, I might reduce the number of articles that I want to download dued to the fact that I can't really read all of them in one day.

In any case, thanks for the tips. I will post any additional progress later.
OrcaBlue is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Supporting the Public Library switchman2210 General Discussions 5 10-20-2010 11:44 PM
How to move public library book from ADE to Sony Library? mom2three Sony Reader 3 06-30-2010 05:26 AM
Public Library GameMisconduct Ectaco jetBook 21 04-11-2010 07:46 PM
WSJ-Sony won't make Xmas for Daily Edition advocate2 News 1 11-18-2009 06:52 PM
Best device for public library use? applecore Which one should I buy? 10 03-01-2009 12:27 PM


All times are GMT -4. The time now is 05:58 PM.


MobileRead.com is a privately owned, operated and funded community.