MobileRead Forums - View Single Post

adinb · 04-04-2007, 03:19 AM

Quote:

Originally Posted by fritz_the_blank

If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory

Are you wanting *just* the week's coverstory? Here's what I came up with for this entry (please pardon any typos since parallels clipboard isn't wanting to work tonight...but I did publish this particular feed for a known working version):

Code:

Link: http://feeds.newsweek.com/CoverStory 
Link Element: origLink
Link Extractor Pattern: id/(\d+)/site
Link Reformatter: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
Content Extraction Pattern: (<div class="caption">.*)

The process I go through to get all this stuff: (may break this into a few messages)

- I enter the rss feed link (I try to get RSS 2.0 links, since some atom date formats aren't complete supported by web2book.) I set the days to "0" and I select test. If the full content of the articles is in the feed and everything is good, you don't have to do anything other than select the number of days you want, name the entry, and select the "enabled" box. If you are just getting a small snippet and want additional content, you need to fill in the "Link Element" so that web2book knows what link to follow.

- Since you have to find the right link for web2book to follow, view the source of the feed. I do this by typing in the URL of the feed into firefox, right clicking on the loaded page and select "view source". I then look for which tag in the page source holds the "real" link to the story, (not a link that goes through feedburner or some in between website.). In this case the source was really funky and tough to read, but the origLink tag had the real link....and presto, that's the "Link Element".

-The next step is to run the test again. The output will probably be weird, but if you have the correct link element, the log should show web2book following the link and then converting the raw html it got into pdf.

-Assuming that web2book grabbed the article page that you wanted, you just have to figure out the "content extraction pattern" that will pull out the content without all the ads. Finding the correct regular expression is a bit of an art. I would recommend using the regular expression helper in web2book in the tools menu to test/experiment to find the right content expression pattern. Copy the page source of the of the page that web2book grabbed the html from in the earlier steps and copy it into the input field. Type your Regular Expression into the RegExp field and click test. The "Group" field will be the html that would be sent on to be turned into PDF. A good guide that I refer to for building regular expressions is http://www.regular-expressions.info/tutorial.html . This is *definitely* an art form, and you might want to search the net for other, more complete tools to assist in building regular expressions. I know that I put in about a full week's worth of time to spin myself back up on complex regex's.

***Tip: Test your regular expressions before even trying them in web2book. web2book just takes the regular expressions and applies them to the html, so even if you *think* you have it right (which I did many, many times when I didn't have it right) you probably are missing a backslash or a parenthesis somewhere.

***Tip: If web2book doesn't actually generate a pdf during a test, take a look at the log. If the extracted link, and link reformatter both look good, then there is an error in your "content extraction pattern" regular expression. If you don't see a correct extracted or reformatted link, then there is an error in your "link element", "link extractor pattern" regular expression, or in your "link reformatter".

-If there is a "print me" link on the page and you want to use that page as your content source instead of the page at the destination of the "link element", then things get a little more complicated. You will have to find whether you can jump to the print page by grabbing the article ID from the "link element" URL or if you have to look on the destination of the "link element" for the URL of the print page. In this example we can grab the article ID directly out of the link element URL using another regular expression ("id/(\d+)/site") and pasting it into the middle of a fairly static URL for printing ("http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/").

If Newsweek didn't want to be nice and be complicated like Time, you would tick the box "Apply extractor to linked content instead of link text" and you would have to write *another* regular expression to be applied to the *content* of the *destination* of the "Link Element" to find the link to the printable version of the page. Take a look at the published Time feeds for a good example of having to go all the way down the rabbit hole to get to the printable versions of the page.

Some sites just plain won't let an automated "scraper" program like web2book to grab the printable versions of their page. They may "lie" and tell you they're going to the printable version of the page and not actually go there. It's tough to debug and will require a bit of intuition.

-Once you get the URL for the printable page, you need to still do the "Content Extraction Pattern" to be applied to the printable page; make sure that you exclude the "<title>" tag, or else you will have a funky title in the finished PDF.

So, that's it for the moment, time for bed tonight, but hopefully this helps a little in getting a good page. I've published a lot of examples, so subscribe to a few feeds using the File|Subscribe command and take a look.

Good luck, and good hunting!