MobileRead Forums - View Single Post

geekraver · 04-05-2007, 05:27 PM

Quote:

Originally Posted by nmackay

Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

He he - well, I do remember the old Sperry 1100 well, writing Fortran progs in punched cards.

By this stage Adin is probably more of an expert than I am. He gave a pretty detailed description of his approach (which I haven't yet read in detail). I'll add mine as it may be slightly different and have some value.

1. First you need to get the URL for the RSS feed of the site you care about. Enter it into your browser and look at the results. If they have the content you want, then all you really need to do is add the URL to web2book; you shouldn't even need to bother with the settings under 'Customize'

2. Assuming they don't have the content you want (e.g. they have an excerpt and end with "Read More" or something like that), then you will need to customize them. Typically I will at this point do two things:

i) right click in the browser and select 'View Source", and look at the RSS XML, to make sure that the permalink or other link has an XML tag that web2book expects; you can see which one web2book expects by going to Customize and clicking on Help. If this feed for some reason has an unusual XML element tag, then you'll need to enter its name in the Link Element field

ii) in the original page in the browser, click on the title link of the first story to have the browser load up the referenced page. We now want to deal with this page, which we'll do in step 3.

3. If the page has a "Printable version" or "Print" link at the top or bottom, we probably want to use that version of the page, as it will have less fluff like ads that needs to be stripped out (if there is no such link go to step 4). So we have to figure out how to get at the link for that. I'll typically hover over the "Print" or "Printable Version" button/link, and see in the status bar of the browser what the URL is for that version. We want to either munge the original article link into this new print one (which we might be able to do just with the link extraction patter and link reformatter), or we may have to suck the link out of the page we are now viewing (which requires checking the checkbox which says "Apply extractor to linked content instead of link text). In the latter case we have to look at the web page source and find the part that has the HREF for the printable version and figure out a regexp pattern to get at that. Regexp patterns and reformatting are a whole separate topic that I will discuss later. Once the link extractor and link reformatter are done, we should have an URL that refers to the low-fluff version of the content. Load up that content in your browser.

4. Now we want to remove ads, etc, from the page. You have to 'View Source' in your browser, and look for the start and end of the content you care about. Then comes the tricky part, which is trying to find some unique delimiters that bracket this content. Once you've found these (and sometimes it isn't possible) you can create a content extraction pattern, and perhaps a content reformatter (if necessary) for getting the content out. A content reformatter is usually just useful if you need to rebalance some HTML tags in the extracted content, or in cases where the content extraction pattern is complex and extracts the content in multiple pieces ("groups") that must be reassembled.

The regular expression helper in the tools menu is very useful for testing your regular expressions. You can do a "View Source" in your browser and paste the full HTML content of a page in the Input box, and enter your regular expression in the RegExp box, and click the Test button, to see what parts of a page your pattern will extract. You must use grouping (which is done with parentheses) to specify the content you want to keep, and if you use more than one group you will need a reformatter to specify how that groups get put back into a single piece of text. When learning to use the regexps also pay attention to "greedy" (match as much text as possible) versus non-greedy (match as little text as possible) matching, as sometimes you need one style and sometimes the other.

If you're lucky you might find DIV html tags with "class" attributes that bracket the content you want. This is fairly common. Comment blocks are also commonly used to identify the article content start and end. An excellent way to master this stuff is to look at the existing published feeds, and work through them yourself, trying to understand how the existing settings make them tick. Do test them though, as websites change and some of the published entries may break, and you might go nuts trying to understand how something works when in fact it doesn't work any more!