MobileRead Forums - View Single Post

toomanybarts · 09-11-2007, 06:23 PM

If someone can help me understand how I would pull content from the following website (using the "Web Page Tab of rss2book) it will go a long way to me understanding not only how this program works, but also the REGEX expressions rqd to get at the content (and only the content) we are all using this program for :
"http://www.timesonline.co.uk/tol/comment/columnists/jeremy_clarkson/"

There are a number of links on the page that reference the various blog entries I want to pull, but when I change rss2book settings for "followlinks" to depth 2 (or more) I get this error
"Processing clarkson

System.UriFormatException: Invalid URI: The URI scheme is not valid.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)"

IF I leave it set at 1 I get
"Processing clarkson

Final content:
===================

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><meta name="ROBOTS" content="NOARCHIVE" /><script type="text/javascript">
// Variables required for DART. MUST BE IN THE HEAD.
var time = new Date();
randnum = (time.getTime());
</script><title> Jeremy Clarkson Columns & Comment | Times Online </title><meta name="Description" content="The UKs favourite motoring journalist comments on British society and culture in his weekly columns on Times Online"><link rel="shortcut icon" type="image/x-icon" href="/tol//img/favicon.ico" type="image/x-icon" /><link rel="stylesheet" type="text/css" href="/tol/css/alternate.css" title="Alternate Style Sheet" /><link rel="stylesheet" type="text/css" href="/tol/css/tol.css"/>
<link rel="stylesheet" type="text/css" href="/tol/css/ie.css"/><link rel="stylesheet" type="text/css" href="/tol/css/typography.css"/><script language="javascript" type="text/javascript" src="/tol/js/tol.js"></script></head><body><div id="top"/><div id="shell"><div id="page"><script language="javascript" type="text/javascript" src="/tol/js/DM_client.js"></script><script language="javascript" type="text/javascript">
DM_addToLoc("Network",escape("Times"));
DM_addToLoc("SiteName",escape("Times Online"));
</script><script language="javascript" type="text/javascript">
// Index page for Revenue sciences"

....there's loads more, this is just part of the content. The point is, I thought that changing the "Follow links to Depth" setting to 2 would grab not only the page referred to in the URL, but also follow the links from that URL's page?
I would then need to work on what REGEX would be needed to tidy up the resulting mass of content. (That would be problem / lesson 2, but one thing at a time!)

Am I missing something?
(I realise there is a RSS feed page where I can pull the current top 4 or 5 blog entries and adinb has helped me clean this up to be readable, what I want to understand is how do I manipulate Webpages)

(Thank-you again to adinb who has been helping me with this problem using the rss feed and the "Feed" tab of rss2book, via PM, it's people like him that keep these types of forums useful...I thought it may be useful for others to understand how it all works and to lighten the load on adinb!)

Thank-you all in advance.

09-11-2007, 06:23 PM	#219
toomanybarts Junior Member Posts: 4 Karma: 10 Join Date: Jul 2007 Device: Sony Reader	If someone can help me understand how I would pull content from the following website (using the "Web Page Tab of rss2book) it will go a long way to me understanding not only how this program works, but also the REGEX expressions rqd to get at the content (and only the content) we are all using this program for : "http://www.timesonline.co.uk/tol/comment/columnists/jeremy_clarkson/" There are a number of links on the page that reference the various blog entries I want to pull, but when I change rss2book settings for "followlinks" to depth 2 (or more) I get this error "Processing clarkson System.UriFormatException: Invalid URI: The URI scheme is not valid. at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind) at System.Uri..ctor(String uriString) at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log) at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log) at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)" IF I leave it set at 1 I get "Processing clarkson Final content: =================== <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><meta name="ROBOTS" content="NOARCHIVE" /><script type="text/javascript"> // Variables required for DART. MUST BE IN THE HEAD. var time = new Date(); randnum = (time.getTime()); </script><!-- Code to display title of the HTML page --><title> Jeremy Clarkson Columns & Comment \| Times Online </title><meta name="Description" content="The UKs favourite motoring journalist comments on British society and culture in his weekly columns on Times Online"><link rel="shortcut icon" type="image/x-icon" href="/tol//img/favicon.ico" type="image/x-icon" /><link rel="stylesheet" type="text/css" href="/tol/css/alternate.css" title="Alternate Style Sheet" /><link rel="stylesheet" type="text/css" href="/tol/css/tol.css"/> <link rel="stylesheet" type="text/css" href="/tol/css/ie.css"/><link rel="stylesheet" type="text/css" href="/tol/css/typography.css"/><script language="javascript" type="text/javascript" src="/tol/js/tol.js"></script></head><body><div id="top"/><div id="shell"><div id="page"><!-- START REVENUE SCIENCE PIXELLING CODE --><script language="javascript" type="text/javascript" src="/tol/js/DM_client.js"></script><script language="javascript" type="text/javascript"> DM_addToLoc("Network",escape("Times")); DM_addToLoc("SiteName",escape("Times Online")); </script><script language="javascript" type="text/javascript"> // Index page for Revenue sciences" ....there's loads more, this is just part of the content. The point is, I thought that changing the "Follow links to Depth" setting to 2 would grab not only the page referred to in the URL, but also follow the links from that URL's page? I would then need to work on what REGEX would be needed to tidy up the resulting mass of content. (That would be problem / lesson 2, but one thing at a time!) Am I missing something? (I realise there is a RSS feed page where I can pull the current top 4 or 5 blog entries and adinb has helped me clean this up to be readable, what I want to understand is how do I manipulate Webpages) (Thank-you again to adinb who has been helping me with this problem using the rss feed and the "Feed" tab of rss2book, via PM, it's people like him that keep these types of forums useful...I thought it may be useful for others to understand how it all works and to lighten the load on adinb!) Thank-you all in advance.