MobileRead Forums - View Single Post

geekraver · 10-08-2006, 03:34 AM

Hi all

Here's a program to make HTML, RTF, LRF or PDF files (the latter supports rich formatting if you have htmldoc installed) from RSS feeds and other websites. You need .Net Framework 2.0 or later installed to run it. PDF output is in iso-8859-15 character set, so some European languages are supported.

The program can write the output files on your PC or sync them directly to the Sony Reader over USB.

Just go to Tools-Options and make sure the options are set the way you want them, add a bunch of RSS feeds to the datagrid on the main window, and hit Go! The program can generate files on your PC or sync directly to your Reader if it is attached.

If you want to use feeds that others have already set up, open the File menu, and select Subscribe. You'll be shown the set of available published feeds. Click the checkboxes next to the ones you want and click the Subscribe button, and they'll be added to your setup. Note that you can subscribe separately to webpage entries from the webpages tab.

Attached are three screenshots; if all you want to do is look at RSS content then the last screenshot covers most of what you'll deal with (once you've checked the options in the first screenshot). The complex looking dialog in the middle is for extracting full HTML from RSS feeds that only include summaries in the feeds; with some tweaking you can get the app to get you full content with ads and other noise stripped.

Non-geeks can stop reading here and should just try the app out using the Subscribe facility in the File menu. Hardcore geeks who understand regular expressions, read on for details of how to add new feeds that no-one has published yet.

web2book supports a fairly powerful extension mechanism. Selecting a feed entry and clicking the Customize button brings up the advanced settings. Once in this property view you can also use the Test button to test your configuration for that feed; if all is well it will open your PDF reader eventually with the output for that site. A fairly detailed log is also generated to help troubleshooting. Once you are satisfied with the results for the entry you created, you can share it with others by clicking the Publish button.

The properties are mostly to support getting full versions of articles, possibly via modified links that point to lower noise printable versions, and extracting a subset of the article HTML (to skip ads, etc).

The various properties for Feeds are:

Url - pretty obvious; this is the RSS feed URL.

Enabled - whether to include this feed when you click on Go! from the main view.

Days - how many days back to go when using RSS entries.

Content Element - in most cases you can leave this blank; if specified (and if the Link Element field described below is blank) then the body of the element with this name will be used for the article text. If blank then rss2book will look for any of 'description', 'summary' or 'content'.

Link Element - the element in the RSS feed that specifies the link to the full article. Don't specify anything here unless you actually want the full article. Otherwise this will typically be either 'link' or 'guid' for most RSS feeds.

Link Extractor Pattern - this is an optional regular expression that will be applied to the link element to parse it into a collection of one or more substrings. You need to use unnamed groups (i.e. bits of regular expression pattern enclosed in parentheses) to identify the various substrings. If you leave this blank the original link will be used to create a single-element collection. Two simple examples:

(\d+) - will extract the first sequence of numbers found in the link element

http://(.*) - will strip off the leading http:// from the link element

Apply extractor to linked content instead of link text - if this is checked, then the extractor pattern above is not applied to the link; instead, we follow the link and retrieve the web page at that link, then apply the extractor pattern to the contents of that page. This is useful, for example, to extract 'printable version' URLs from article pages if there is no simple textual mapping from an article URL to the corresponding 'printable version' URL, but the 'printable version' URL is contained in the article page (tip: for web pages that have printable versions, the printable version is preferable).

Link Formatter - this is a format string that gets used to create a new link from the collection created above by the link extractor. It consists of a string with parameters {0}, {1}, {2}, etc, which are expanded to the various substrings in the collection. If you leave it blank that is equivalent to "{0}" - i.e. just use the first substring.

Content Extraction Pattern - this is a regular expression that is applied to the article content HTML from the previous step. It should have a single unnamed group; the text that matches that group is used as the final article content HTML. If left blank then the full article content from the link processing step is used.

Content Reformatter: This is similar to the link formatter. It can be used to wrap or insert some additional HTML around the content extracted by the pattern in the last step. Ifd left blank it has no effect. Once again positional parameters {0}, ... are used to identify the matched groups from the content extraction step.

The Tools menu has a regular expression tester that you may find helpful when doing advanced feed setups.

Okay, this probably sounds more complicated than it is, so here are some examples:

Name: BBC News
URL: http://newsrss.bbc.co.uk/rss/newsonl...t_page/rss.xml
Link Element: guid
Link Extractor Pattern: http://(.*)
Link Reformatter: http://newsvote.bbc.co.uk/mpapps/pagetools/print/{0}
Content Extraction Pattern:

i.e. get the RSS feed from the URL, pull out the links in the 'guid' elements, strip off the 'http://' part, prepend http://newsvote.bbc.co.uk/mpapps/pagetools/print/. then get the HTML at that link.

Name: Slate
URL: http://www.slate.com/rss/
Link Element: link
Link Extractor Pattern: (\d+)
Link Reformatter: http://www.slate.com/toolbar.aspx?action=read&id={0}
Content Extraction Pattern: (\<font.*)Article URL

I.e. get the RSS from http://www.slate.com/rss/, pull out each 'link' element, extract the sequence of digits from such an element and append it to 'http://www.slate.com/toolbar.aspx?action=read&id=', fetch the HTML at that URL, then extract everything starting from the first '<font>' tag up to but not including the text 'Article URL'.

Name: Reuters Top News
URL: http://feeds.reuters.com/reuters/topNews/
Link Element: guid
Link Reformatter: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID={0}
Content Extraction Pattern: (<span class=\"artTitle.*)</td>

i.e. get the RSS at http://feeds.reuters.com/reuters/topNews/, pull out each guid element, append the guid to 'http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID=', get the HTML at that URL.

If you want to put Wikipedia articles on your reader, use something like:

URL: http://en.wikipedia.org/wiki/Nikola_Tesla
HTML: checked
Content Extraction Pattern: (.*)

Website entries support some metacharacters in the URL for dates, namely @yyyy, @yy, @mm and @dd. These are expanded to the year, month of day (either 4 or 2 digits for year; two digits for the others). If you specify a Number Of Days entry, then the URL will be expanded for each day in range and the contents for each day will be concatenated, starting with the oldest, and ending with the current day. For example, the following will get one week of Dilbert comic strips:

Url: http://www.unitedmedia.com/comics/di...yyy@mm@dd.html
Number Of Days: 6
Content Extractor Pattern: (<IMG SRC="/comics/dilbert/archive/images/dilbert[^>]*>)
Content Reformatter: {0}<br>

10-08-2006, 03:34 AM	#1
geekraver Addict Posts: 364 Karma: 1035291 Join Date: Jul 2006 Location: Redmond, WA Device: iPad Mini,Kindle Paperwhite	Web2Book Hi all Here's a program to make HTML, RTF, LRF or PDF files (the latter supports rich formatting if you have htmldoc installed) from RSS feeds and other websites. You need .Net Framework 2.0 or later installed to run it. PDF output is in iso-8859-15 character set, so some European languages are supported. The program can write the output files on your PC or sync them directly to the Sony Reader over USB. Just go to Tools-Options and make sure the options are set the way you want them, add a bunch of RSS feeds to the datagrid on the main window, and hit Go! The program can generate files on your PC or sync directly to your Reader if it is attached. If you want to use feeds that others have already set up, open the File menu, and select Subscribe. You'll be shown the set of available published feeds. Click the checkboxes next to the ones you want and click the Subscribe button, and they'll be added to your setup. Note that you can subscribe separately to webpage entries from the webpages tab. Attached are three screenshots; if all you want to do is look at RSS content then the last screenshot covers most of what you'll deal with (once you've checked the options in the first screenshot). The complex looking dialog in the middle is for extracting full HTML from RSS feeds that only include summaries in the feeds; with some tweaking you can get the app to get you full content with ads and other noise stripped. Non-geeks can stop reading here and should just try the app out using the Subscribe facility in the File menu. Hardcore geeks who understand regular expressions, read on for details of how to add new feeds that no-one has published yet. web2book supports a fairly powerful extension mechanism. Selecting a feed entry and clicking the Customize button brings up the advanced settings. Once in this property view you can also use the Test button to test your configuration for that feed; if all is well it will open your PDF reader eventually with the output for that site. A fairly detailed log is also generated to help troubleshooting. Once you are satisfied with the results for the entry you created, you can share it with others by clicking the Publish button. The properties are mostly to support getting full versions of articles, possibly via modified links that point to lower noise printable versions, and extracting a subset of the article HTML (to skip ads, etc). The various properties for Feeds are: Url - pretty obvious; this is the RSS feed URL. Enabled - whether to include this feed when you click on Go! from the main view. Days - how many days back to go when using RSS entries. Content Element - in most cases you can leave this blank; if specified (and if the Link Element field described below is blank) then the body of the element with this name will be used for the article text. If blank then rss2book will look for any of 'description', 'summary' or 'content'. Link Element - the element in the RSS feed that specifies the link to the full article. Don't specify anything here unless you actually want the full article. Otherwise this will typically be either 'link' or 'guid' for most RSS feeds. Link Extractor Pattern - this is an optional regular expression that will be applied to the link element to parse it into a collection of one or more substrings. You need to use unnamed groups (i.e. bits of regular expression pattern enclosed in parentheses) to identify the various substrings. If you leave this blank the original link will be used to create a single-element collection. Two simple examples: (\d+) - will extract the first sequence of numbers found in the link element http://(.) - will strip off the leading http:// from the link element Apply extractor to linked content instead of link text - if this is checked, then the extractor pattern above is not applied to the link; instead, we follow the link and retrieve the web page at that link, then apply the extractor pattern to the contents of that page. This is useful, for example, to extract 'printable version' URLs from article pages if there is no simple textual mapping from an article URL to the corresponding 'printable version' URL, but the 'printable version' URL is contained in the article page (tip: for web pages that have printable versions, the printable version is preferable). Link Formatter - this is a format string that gets used to create a new link from the collection created above by the link extractor. It consists of a string with parameters {0}, {1}, {2}, etc, which are expanded to the various substrings in the collection. If you leave it blank that is equivalent to "{0}" - i.e. just use the first substring. Content Extraction Pattern - this is a regular expression that is applied to the article content HTML from the previous step. It should have a single unnamed group; the text that matches that group is used as the final article content HTML. If left blank then the full article content from the link processing step is used. Content Reformatter: This is similar to the link formatter. It can be used to wrap or insert some additional HTML around the content extracted by the pattern in the last step. Ifd left blank it has no effect. Once again positional parameters {0}, ... are used to identify the matched groups from the content extraction step. The Tools menu has a regular expression tester that you may find helpful when doing advanced feed setups. Okay, this probably sounds more complicated than it is, so here are some examples: Name: BBC News URL: http://newsrss.bbc.co.uk/rss/newsonl...t_page/rss.xml Link Element: guid Link Extractor Pattern: http://(.) Link Reformatter: http://newsvote.bbc.co.uk/mpapps/pagetools/print/{0} Content Extraction Pattern: i.e. get the RSS feed from the URL, pull out the links in the 'guid' elements, strip off the 'http://' part, prepend http://newsvote.bbc.co.uk/mpapps/pagetools/print/. then get the HTML at that link. Name: Slate URL: http://www.slate.com/rss/ Link Element: link Link Extractor Pattern: (\d+) Link Reformatter: http://www.slate.com/toolbar.aspx?action=read&id={0} Content Extraction Pattern: (\<font.)Article URL I.e. get the RSS from http://www.slate.com/rss/, pull out each 'link' element, extract the sequence of digits from such an element and append it to 'http://www.slate.com/toolbar.aspx?action=read&id=', fetch the HTML at that URL, then extract everything starting from the first '<font>' tag up to but not including the text 'Article URL'. Name: Reuters Top News URL: http://feeds.reuters.com/reuters/topNews/ Link Element: guid Link Reformatter: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID={0} Content Extraction Pattern: (<span class=\"artTitle.)</td> i.e. get the RSS at http://feeds.reuters.com/reuters/topNews/, pull out each guid element, append the guid to 'http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID=', get the HTML at that URL. If you want to put Wikipedia articles on your reader, use something like: URL: http://en.wikipedia.org/wiki/Nikola_Tesla HTML: checked Content Extraction Pattern: <!-- start content -->(.)<!-- end content --> Website entries support some metacharacters in the URL for dates, namely @yyyy, @yy, @mm and @dd. These are expanded to the year, month of day (either 4 or 2 digits for year; two digits for the others). If you specify a Number Of Days entry, then the URL will be expanded for each day in range and the contents for each day will be concatenated, starting with the oldest, and ending with the current day. For example, the following will get one week of Dilbert comic strips: Url: http://www.unitedmedia.com/comics/di...yyy@mm@dd.html Number Of Days: 6 Content Extractor Pattern: (<IMG SRC="/comics/dilbert/archive/images/dilbert[^>]>) Content Reformatter: {0}<br> Attached Thumbnails Last edited by geekraver; 04-16-2007 at 11:46 AM.