MobileRead Forums - View Single Post

geekraver · 11-04-2006, 02:59 PM

You need to filter the article content, which is done by the 'Content Extraction Pattern'. This will work:

(<div id="post.*)<div class="postMetaData">

Alternatively, import the attached xml file.

The stuff that gets included is the stuff in parentheses, so this pattern says include everything starting from the first occurrence of '<div id="post' up to but not including the last occurrence of '<div class="postMetaData>'.

The .* matches any text of zero or more characters. The match is 'greedy'; i.e. as much text as possible gets matched, which is why we start with the FIRST occurence of '<div id="post' and end with the LAST occurence of '<div class="postMetaData'. There's probably only one occurence of each anyway but its worth mentioning the greedy aspect as it can cause confusion.

When experimenting with the patterns, use the RegExp Helper under the Tools menu. You can paste the web page HTML source into the Input box, then enter different patterns in the RegExp textbox. Click on Test and you will be shown the text that matches the whole pattern and the text that matches the parenthesized part of the pattern (i.e. the ultimately important stuff).