Thread: Rss2Book
View Single Post
Old 11-04-2006, 02:59 PM   #37
geekraver
Addict
geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.
 
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
You need to filter the article content, which is done by the 'Content Extraction Pattern'. This will work:

(<div id="post.*)<div class="postMetaData">

Alternatively, import the attached xml file.

The stuff that gets included is the stuff in parentheses, so this pattern says include everything starting from the first occurrence of '<div id="post' up to but not including the last occurrence of '<div class="postMetaData>'.

The .* matches any text of zero or more characters. The match is 'greedy'; i.e. as much text as possible gets matched, which is why we start with the FIRST occurence of '<div id="post' and end with the LAST occurence of '<div class="postMetaData'. There's probably only one occurence of each anyway but its worth mentioning the greedy aspect as it can cause confusion.

When experimenting with the patterns, use the RegExp Helper under the Tools menu. You can paste the web page HTML source into the Input box, then enter different patterns in the RegExp textbox. Click on Test and you will be shown the text that matches the whole pattern and the text that matches the parenthesized part of the pattern (i.e. the ultimately important stuff).
Attached Files
File Type: xml Damn Interesting.xml (403 Bytes, 1210 views)

Last edited by geekraver; 11-04-2006 at 03:04 PM.
geekraver is offline   Reply With Quote