Picking up b_k's idea
Quote:
Originally Posted by b_k
well, not clean text, but look what is in a tagesschau.de html between "<div class="contModule conttext article">" and "<div class="standDatum">Stand: DD.MM.YYYY HH:MM Uhr</div>"
|
proved fruitful:
Now, it is possible retrieve and include the contents of a linked article and get it displayed in either HTML or LaTeX.
In order to achieve this an additional flag had to be (re-)introduced -r and the "syntax" of the -f flag was extended. Its syntax is now
Code:
-f <URL>;<start>;<stop>
where <URL> is the address of the feed itself, and <start> and <stop> are tags (N.B. not necessarily HTML-tags!) used identify the starting and stopping positions, resp. to cut the article out of the page that is to download for a given item.
Unless -r is set, there won't be any downloads, irrespective of whether any <start>- or <stop> tags are given.
Details on the usage can be found in my personal .getfeedrc I attached.
Then maybe a few words of caution should be said (before getting flamed)
- DON'T use a line containing more than one HTML-tag for <start> or <stop>. During the parsing of a page its content is re-formatted such that only one HTML-tag is contained per line, thus such a tag will never be found!
- The search for <start> and <stop> employs perl's REGULAR EXPRESSION search. If you know that and know regExp's this will come quite handy, otherwise this might turn out rather annoying.
- Not all standard HTML/XHTML tags are recognised and translated to the respective LaTeX commands.
- A number of things/tags are removed completely from the original HTML. Such as <html>, <input>, <img>, <select>, <form> etc.
- Tables - although quite fancy in HTML - are not rendered into their respective LaTeX equivalents (at least not yet...) and neither are they copied as such into the HTML output.
And if you want to know whether this is something for you, just have a look at the PDF attached.
Hoping that someone finds this useful...