There are a few important settings on the advanced tab that you should consider (See Figure 7, bottom of this post).
Just like your regular PC browser, Sunrise XP can cache downloaded content, and it’s suggested that you select the box to do so if not already checked. The cache is the same one that Internet Explorer uses. By checking the cache, Sunrise will not download content if you already have the identical file cached on your PC. This will make the overall process faster as well as save bandwidth for the web site’s host. Control of the size of this cache is done through Internet Explorer’s menu (or control panel’s “Internet Options” menu).
Priority is something I don’t use, but basically you can prioritize Sunrise’s sequence for downloading documents. Unless you prioritize them, Sunrise will update documents in alphabetical order of the document’s name (first item in “Main” tab)
I always have the “Include URL info” box checked. What this means is that if you are in a Plucker document and you want to know the specific URL of what you’re reading for future reference, Plucker will be able to display it. It’s also helpful if you want to view the URL for a website that is beyond your link depth, or has Flash or other content that Plucker cannot display. In either case, Plucker can copy the URL to the PDA’s Memo Pad, a very useful feature. Laurens’ instructions state that the “Include URL info” should not be checked if your Source document is a local file on your hard drive (which also can be processed by Sunrise for viewing on your PDA).
The “Don’t display unresolved links” checkbox is a matter of personal taste, I never check it. As noted earlier, Plucker can display unreachable links (in red) or accessible links (in blue). If you check this box, the unreachable links will not be visible at all, but will just appear like plain text, which might be less distracting for you. I like to know if there’s a link I can’t reach, because I might want to find out the URL for later viewing.
The “Link Filters” is a VERY
important setting. We’ve already had the option to filter out content from different domains, etc. Here, you can designate specific URLs that you don’t want to download, or provide Sunrise with wildcards that filter URLs that have a specific pattern of characters. If you were good with your earlier “site reconnaissance”, you’ll know which links you DON’T
want and which links you DO
want. Link filters are processed in the order you present them. The easiest way to illustrate what the filters do is to look at the link filters that I use for all New York Times downloads (again, Figure 7).
Here’s what each filter in the image does from top to bottom:
Basically, I want to read the New York Times articles, and nothing else. This first filter limits most of my downloads to actual articles. Any web site that has a URL that starts with “http://www.nytimes.com/20*”
will be downloaded. (The asterisk indicates a wildcard, basically it represents any possible text.) This is how almost all of the NY Times’ article URLs are set up. For example, the first article on today’s page has the URL: http://www.nytimes.com/2006/04/09/wo...cnd-nepal.html
so this filter would allow this link to be downloaded. Any links that don’t follow this convention are probably not content I want. The main section pages (National, Washington, Sports, etc.) do NOT follow this URL convention, but they seemed superfluous since I primarily only want to see content from the Source (front) page, so I intentionally had the filter work as it does. (See Figure 8).
The next filter is probably not necessary, but I wanted to be sure that all images come through, so by using “*com/images”, any URLs that end with those characters will be included.
I noticed after using Sunrise for awhile that none of the travel articles ever were downloaded. At one time, the New York Times had a different convention for assigning URLs in the travel section, though this no longer seems to be the case. The http://travel2*
wildcard enabled me to get all the travel articles.
Many articles in the New York Times website are stretched over multiple pages, and you are provided a link for a “single page version.” You also are often provided a “printer friendly version”. Both of these alternate versions are superfluous, since they have the same content as the primary pages, so I wanted to not download those versions. The URLs of all printer-friendly versions of articles end with the text “pagewanted=print”, so I had the filter ignore those URLs. Similarly, “pagewanted=all” gives you the single-page duplicate, so I set up the wildcard to filter out those URLs as well. I’ll leave it to you to figure out what the filters */fashion/* and *privacy* filter out.
You’ll have to do a little trial-and-error if you have a lot of things you’ll want to filter in or filter out, but once you’re set up (as long as the web site doesn’t change its conventions for URLs), it works great. To create a filter, hit the “new” button and you’ll get the “link filter” box, which gives you various choices. For the pattern, you can put either a “regular expression” (which is a specific URL), or a Wildcard, which uses the asterisk(s) as I did above, which can represent anything. I never change the “Filter all Links” drop-down, but you could have it filter specific HTML tags. (Don’t worry, I barely know what that means myself…) You then need to decide whether you want to only include or only exclude URLs following your wildcard designation. I’ve never used the “rewrite links matching this pattern” setting; perhaps Laurens can help explain that one. Update: DTM has helpfully included an explanation about how link re-writing works; it's in his post further down below some of the comments here...
At this point, if you have different documents that you still want to add to the SXL, go back to Step 6 and add new documents, filling in all the needed data from the four tabs.
Once you have all your documents entered into the SXL, you’re going to want to save the SXL somewhere on your PC. I keep all my SXLs in a folder called …/My Documents/Plucker.