The Daily iLiadian - Page 2

Tommy · 01-28-2007, 11:10 AM

Picking up b_k's idea

Quote:

Originally Posted by b_k

well, not clean text, but look what is in a tagesschau.de html between "<div class="contModule conttext article">" and "<div class="standDatum">Stand: DD.MM.YYYY HH:MM Uhr</div>"

proved fruitful:
Now, it is possible retrieve and include the contents of a linked article and get it displayed in either HTML or LaTeX.
In order to achieve this an additional flag had to be (re-)introduced -r and the "syntax" of the -f flag was extended. Its syntax is now

Code:

 -f <URL>;<start>;<stop>

where <URL> is the address of the feed itself, and <start> and <stop> are tags (N.B. not necessarily HTML-tags!) used identify the starting and stopping positions, resp. to cut the article out of the page that is to download for a given item.
Unless -r is set, there won't be any downloads, irrespective of whether any <start>- or <stop> tags are given.

Details on the usage can be found in my personal .getfeedrc I attached.

Then maybe a few words of caution should be said (before getting flamed)

DON'T use a line containing more than one HTML-tag for <start> or <stop>. During the parsing of a page its content is re-formatted such that only one HTML-tag is contained per line, thus such a tag will never be found!
The search for <start> and <stop> employs perl's REGULAR EXPRESSION search. If you know that and know regExp's this will come quite handy, otherwise this might turn out rather annoying.
Not all standard HTML/XHTML tags are recognised and translated to the respective LaTeX commands.
A number of things/tags are removed completely from the original HTML. Such as <html>, <input>, <img>, <select>, <form> etc.
Tables - although quite fancy in HTML - are not rendered into their respective LaTeX equivalents (at least not yet...) and neither are they copied as such into the HTML output.

And if you want to know whether this is something for you, just have a look at the PDF attached.

Hoping that someone finds this useful...

ebookie · 10-20-2007, 07:28 PM

I am struck by how cool this could be if it were done legitimately. What I mean is if you came up with a way to pay an author for his or her reportage and a way to select what you were willing to pay for the article you could actually create something really useful out of this. Rather than try to steal the content out from under some web site which is using it to generate advertising revenue which pays the salaries of the people who are running the site in the first place.

It is too bad that "real" newspapers are so hung up on "protecting" their cash cow (which is hemmoraging but they can't start raising a new cow before it dies somehow) that they don't really "get" this opportunity.

--Chuck

kovidgoyal · 10-20-2007, 08:01 PM

I just noticed this thread, I've made a lot more progress on this, though for the SONY Reader. I can generate beautifully formatted LRF files with nice hierarchical table of contents from the RSS of the nytimes, bbc and newsweek. It uses the print version of the articles, so no pictures, but otherwise generates a very pretty ebook.

It's based on a pretty simple plug-in system that should allow people to write plugins for their favorite feeds.

All part of libprs500.

Tommy · 10-26-2007, 07:53 AM

Quote:

Originally Posted by ebookie

I am struck by how cool this could be if it were done legitimately. What I mean is if you came up with a way to pay an author for his or her reportage and a way to select what you were willing to pay for the article you could actually create something really useful out of this. Rather than try to steal the content out from under some web site which is using it to generate advertising revenue which pays the salaries of the people who are running the site in the first place.

...
--Chuck

You have a point or two there, however, I cannot fully agree with what you say about "stealing". That assertion is not true, for newsagencies put these RSS feeds online as a free (as in beer) service, and if they didn't want people to read them free of charge, they should stop publishing it, or charge it.

And as for ads coming along, first, an RSS feed does not transport any adds, secondly the ads will actually read if they are on the page... admittedly only by the tool, but so what? When it comes to counting hits, it doesn't matter and the content provider still can tell the advertiser how many hits he got on this particular page. And thirdly, depending on where and how the ad is placed, it will still appear on the ebook.

Tommy

ebookie · 11-02-2007, 10:40 PM

I agree with you Tommy that pulling the RSS feed and putting it on the Iliad is a perfectly legitimate use of the feed. I wrote something similar to your perl script in python. The "stealing" part involves fetching the whole story, stripping off the window dressing that it had on its web site, and putting that on the Iliad. The provider of the feed expects you to click a link in your RSS reader and to go to their web site, which will display a bunch of annoying ads and on the off chance you click on one will pay them a bit of coin. So if you suck the story off the site and strip out their ads and such and put it on the Iliad they think of that as 'stealing' their content, just like they complain when people put their web page in a frame with someone else's advertising outside the frame.

I make no claim as to the rightness or wrongness of this, but for better or worse it is the current business model people like Reuters, AP, Etc use to "monetize" their work (that is code for get paid for having people do this all day long). I managed to get AP to tell me what it would cost to push the whole story to an Illiad and they said between $400 - $600 per story depending on how many people it was being sent to (I know that probably doesn't make sense but they see it as a way of collecting a fraction of the money you will be making off the story as sized by your readership, they are stuck in the magazine/newspaper model where number of subscribers determines what you can charge for ads, so if you have a lot of subscribers you can charge a lot for ads and make more per page, etc etc.)

Personally I'd like to cut AP out of the loop, basically create an automated system whereby people could submit a story for publication, pay them a fixed price for it, and then put together a newspaper from the best stories. But some people can't write, and other people are carrying some sekrit agenda (like they work for Microsoft in their day job) and so out of the chute I don't want to just pay people $500 a story but rather $1 a story and then publish it and figure out some way of measuring their credibility, as their credibility index goes up would be happy to pay them more. Sort of like reading Slashdot at a high moderation level. I figure an honest, hard working, journalist who reports a balanced account of the story is worth 500x more than one who is being compensated to be the mouthpiece of some special interest. Unfortunately there isn't a "Special Interest Lapdog" registry

.

So the value-add of an Associated Press is that they have, in theory, screened their journalists and pay them an appropriate amount to keep them honest. If someone wrote two decent articles a week and got paid $500 each for them that would be a pretty decent wage in many parts of the USA.

Anyway, to hammer the point home. Ask any "famous" blogger for permission to pull their blog entries and publish them in your e-paper magazine. I expect most of them would ask you to pay them for that right, and if you said "But I don't pay anything to read you blog on Blogger" they will say but they get advertising revenue from visits to their blog page that they wouldn't get from you. So if you re-published them without their permission they might call it 'stealing' from them.

--Chuck

fodiator · 12-21-2007, 05:19 AM

Dear Tommy,
as X-MAs is near ;-) could you please provide any hint how to getfeed "Der Standard" and "Spiegel" properly.

I have tried several configs and the most reasonable for me would be:

Code:

-f http://derStandard.at/?page=rss&ressort=Newsroom;<!-- google_ad_section_start -->;<!-- google_ad_section_end -->

but no full text pages appear within PDF

Unfortunately even worse is :

Code:

-f http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml;<h4>;<div class="spDottedLine">

which just produces multiple LaTex erros and no pdf-file at all.

Kind regards
Harald

thetechnobear · 12-26-2007, 10:25 AM

some feeds are harder than others... as the script needs to remove 'code' from the newsfeed.

a modification i made (easy enough if you look at the scripts) is to get the scripts to download the 'print' version rather than the webversion
(usually if you go to print, you will see its a modification of the original URL which you got from the RSS feed)

the print version often (havent checked the spiegel) has less code and formatting and so the scrapping works better.

tommy, are you still around these parts? if so i could send you my modifications for inclusion if you wish.

Tommy · 12-28-2007, 03:17 AM

Hi Harald,

First, sorry for the late reply, I saw your message just today.

I had a look at the "articles" of the Standard and all I found is essentially some javascript code. And therefore the tags you provided for that feed cannot pull anything from the respective article, so as the previous poster mentioned, one might need to change the code, to get the actual articles buried somewhere out there.

As for the Spiegel I saw that the feed doesn't provide a description tag?! So, all we get there are the headlines..., but the links work!
One of the LaTeX errors you receive for Spiegel articles is due to the start tag specified

Code:

<h4>

This will start the article right after the <h4>, i.e. with the head line of the article. But after this headline there is a closing tag </h4> which will be translated into a "}" causing (part of) the fuss.

I regret I cannot look deeper into this right now, but as I'll be heading for holidays today, I'm a bit in a hurry.

Guten Rutsch and for the EN-speakers among us Happy New Year,
Tommy

Quote:

Originally Posted by fodiator

Dear Tommy,
as X-MAs is near ;-) could you please provide any hint how to getfeed "Der Standard" and "Spiegel" properly.

I have tried several configs and the most reasonable for me would be:

Code:

-f http://derStandard.at/?page=rss&ressort=Newsroom;<!-- google_ad_section_start -->;<!-- google_ad_section_end -->

but no full text pages appear within PDF

Unfortunately even worse is :

Code:

-f http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml;<h4>;<div class="spDottedLine">

which just produces multiple LaTex erros and no pdf-file at all.

Kind regards
Harald

Tommy · 12-28-2007, 03:35 AM

Quote:

Originally Posted by thetechnobear

(...)
tommy, are you still around these parts? if so i could send you my modifications for inclusion if you wish.

I'm still around and I'd be pleased if you sent your changes.
My email would be: tommy.berndt(at)gmx.de.
But as I'll be away for a fortnight it'll take some time until I'll have a look at the code. So, I think it better if you published your version directly here in the forum, such that everyone could access it immediately.

I uploaded my current version of getfeed.pl together with a config file, that might (or might not) be useful...

Tommy

fodiator · 01-11-2008, 08:41 AM

Hi,
I have jumped into perl and fiddled around with getfeed code to bring DerStandard.at to work. Although my implementation is quite ugly (hard coded) and I still have some probs concerning charmaps and special characters the result is promising.
I found out that Tommys improvement to define some start- and stoptags in the getfeedrc file would not neccessarely be enough for complexer Web-Services. I would therefore like to discuss the idea of implementing kind of modules (containing perl code) to handle specific formatting of index and content pages.
As I am a Perl newcomer I would not dare to propose how this could be done the best way, so feedback is kindly welcome!
Nevertheless I would be glad to provide my getfeed patch if there is any interest.

Kind regards
Harald

thetechnobear · 01-11-2008, 04:43 PM

sry, been away

attached is the changed file

my extra option is -P (for printed version)
and takes a parameter which is
from-url;to-url;
which is a normal reg expression

an example of my command line would be:

Code:

getfeed.pl 
-o BBC.tex 
-F tex 
-S ../res/iliad.sty 
-s 
-r 
-t BBC 
-C "/usr/texbin/pdflatex -interaction=nonstopmode BBC.tex" 
-f "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml;<div class="headline">;<div class="footer">" 
-P 'http:\/\/(.*);http://newsvote.bbc.co.uk/mpapps/pagetools/print/;'

the trick is to workout, how to get from the rss url to the print, but often its surprisingly simple (as is the case with the bbc)

Tommy · 01-16-2008, 01:22 PM

Hi,
back from holidays (unfortunately, but all things have to end somewhen...)

The idea technobear implemented is actually a very neat one

...provided you have something like a regexp-view to easily see how the original URL converts to the "printed view" URL.

I have to admit, I wouldn't have come up with that regexp.

But as the idea is out and if all printed view URLs are actually so easily to transform, it's absolutely worth incorporating. (However, currently it works only if a single feed is used, given more, the second (all the following) will fail. So, a little further hacking will be needed)

Quote:

Originally Posted by thetechnobear

sry, been away
<some snipping>

Code:

getfeed.pl <further snipping>
-f "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml;<div class="headline">;<div class="footer">" 
-P 'http:\/\/(.*);http://newsvote.bbc.co.uk/mpapps/pagetools/print/;'

Tommy · 01-20-2008, 12:25 PM

Hi Harald,

Quote:

Originally Posted by fodiator

Hi,
I have jumped into perl and fiddled around with getfeed code to bring DerStandard.at to work. Although my implementation is quite ugly (hard coded) and I still have some probs concerning charmaps and special characters the result is promising.
I found out that Tommys improvement to define some start- and stoptags in the getfeedrc file would not neccessarely be enough for complexer Web-Services. I would therefore like to discuss the idea of implementing kind of modules (containing perl code) to handle specific formatting of index and content pages.
<snip>
Kind regards
Harald

I'd refrain from publishing something that is hardwired for a single feed (as the Standard), but the idea to provide something like "plugins" tailored for specific feeds sounds like a very good on. (I think something similar has already been done by kovidgoyal for the Sony, mentioned a few posts upwards...)
But I'm afraid that this road would/will (further) alienate non-perl speaking people from getfeed, unless some default behaviour remains in the core for standard feeds, that don't need special processing.

However, the more I think about it, the more I warm to the idea ... (some ideas have already started popping up

)

If anyone has already thought out something in this direction, please speak out!
Tommy

Tommy · 01-26-2008, 05:12 AM

Hi all,

here comes a new version of getfeed which incorporates both the ideas of thetechnobear and fodiator proposed above.

Some (sort of) documentation:

Code:


getfeed V0.9e (c) by T.Berndt
This program comes with ABSOLUTELY NO WARRENTY.

usage: getfeed [...] [-o <outfile>] [-f] <feed> [<feed_1> ...]
  -f <feed>[;<start>;<stop>;<filter>;<server>;<srcURL>;<toURLa>;<toURLb>]
               : <feed> is a URLs or a filenmae.
  -d <directory: saves output into <directory>
  -o <outfile> : saves output into <outfile>
  -t <title>   : Title of this news' edition
  -r           : Retrieve and append linked atricles. Default: no
  -R <file>    : Reads <file> instead of .getfeedrc
  -e <charset> : Use <charset> for encoding. Default: utf-8
  -F <format>  : Output format: html(obvious) or tex(LaTeX) Default: html
  -S <style>   : Reads <style> and adds its content as style-information.
  -P <package> : Adds a \usepackage{<package>} into the LaTeX-file
  -C <cmd>     : Execute <cmd>
  -m           : format text in two columns
  -a           : Auto-name the output as news_YYYYMMDD.<format> Default: no
  -v           : Print debugging info to STDERR/<log>.
  -s           : Suppress all output. Default: no (i.e. not silent)
  -l <log>     : Writes debugging information to <log>

Run getfeed -v -h for more information!

getfeed reads news-feeds and converts them into either an HTML-or LaTeX-file.
The feeds currently understood are RSS, ATOM and RDF {0.91, 1.0, 2.0}.

And some more explanation:

config file:
If the home directory contains a config file named .getfeedrc
this will be parsed for the above flags and those will be used
as default settings. Settings/input from the command line
override the default settings/values.
By means of -R <file> a different file can be specified.
On the -f swtich:
If the feed starts with HTTP:// the respective file will be
downloaded, otherwise it will be read from the filesystem.
On the <start> and <stop> tags:
The <start> and <stop> tag are used as markers to cut the
interesting part out of the downloaded article. They need to
be given only if -r is given.
If <start> is provided but <stop> is not, <start> is interpreted
as a program to receice and process the downloaded page:
This program must accept the name of the file into which getfeed saves
the current page. After processing of the page the program must write
its results back to this file.
On the <filter> tag:
If this tag is given it will interpreted as the filename to
be looked up for key words that will be checked against.
The format of this file is <word> <weight>. When the checks
are made the feeds header and description are parsed and if
a word is found a counter will be increased by the weighing
factor associated with that word. If this sum exceeds some
threshold, this item will be rejected. The threshold is given
as #! n anywhere in the file.
On the <server> tag:
If specified, it will be used to download images from and include
them into the output.
On the tags <srcURL> <toURLa> and <toURLb>:
These tags can be specified to "redirect" the URL of the current
feed to point to a different page, as e.g. the print-edition of the
current page.
Credit for this needs to go to 'thetechnobear' as he proposed this
feature and provided a prototype. Check
https://www.mobileread.com/forums/sho...?t=7796&page=2
On the -r switch:
The linked page is dowonloaded only if <start> and <stop> tags
are given for this feed.
On the -a switch:
This is to allow to keep the news in a sorted way for later look-up
If <outfile> is given as well, this will override -a.
On debugging/logging:
The -v switch is stack-able i.e. -v -v will produce more output.
On the behaviour of "binary" switches:
-a, -r, -s toggle, i.e. -a -a effectively turns autoname off.

WARNING:
As can be read above fodiator's idea to facilitate plugins has been realised by just calling an external program to "massage" the current item's page and return its result to getfeed for inclusion. Of course, this opens every door to malign code to wreak havoc on your computer, so it's up to you to check that program carefully, beforehand.

I chose this approach as
(i) it offers users to use and provide there own logic in any language they like,
(ii) it doesn't impose any artifical restrictions like interfaces or APIs, and
(iii) it is the simplest approach to realise

second WARNING:
I haven't checked this feature myself! I only wrote two sample programs - caller.pl and callee.pl - as proof of principle.

Hoping you find it useful...
Regards,
Tommy

---
please note, the "plugin" mechanism doesn't work yet :-( I just checked it.
---
UPDATE
The "plugin" mechanism has been fixed and is working now!
I uploaded the latest version (0.9e) of getfeed along with an example "plugin" (callee.pl). This program does nothing but turn the text into upper case, to illustrate the usage of this feature.
However, it might also serve as template or a starting point for your "plugins"

fodiator · 01-30-2008, 04:26 AM

Quote:

Originally Posted by Tommy

Hi all,

here comes a new version of getfeed which incorporates both the ideas of thetechnobear and fodiator proposed above.

Tommy, thank you for this quick response! I will try to merge your new version with my dirty one and to provide a working plugin. Unfortunately I am very busy the upcoming 2 weeks so it might take some time. Until then I embed my recent version which leeches DerStandard.at. The major changes to the last version of thetechnobear:

the http-get method supports all charsets (esp. utf8) now
some of the links of derstandard rss feeds point to contex pages where further links to the content pages are provided. This pages now are parsed and the topmost link is referenced as content.

12-21-2007, 05:19 AM	#21
fodiator Member Posts: 21 Karma: 12 Join Date: Sep 2007 Device: Irex ILiad	multiple latex errors Dear Tommy, as X-MAs is near ;-) could you please provide any hint how to getfeed "Der Standard" and "Spiegel" properly. I have tried several configs and the most reasonable for me would be: Code: -f http://derStandard.at/?page=rss&ressort=Newsroom;<!-- google_ad_section_start -->;<!-- google_ad_section_end --> but no full text pages appear within PDF Unfortunately even worse is : Code: -f http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml;<h4>;<div class="spDottedLine"> which just produces multiple LaTex erros and no pdf-file at all. Kind regards Harald Last edited by fodiator; 12-21-2007 at 10:00 AM.

01-11-2008, 08:41 AM	#25
fodiator Member Posts: 21 Karma: 12 Join Date: Sep 2007 Device: Irex ILiad	Suggestion Hi, I have jumped into perl and fiddled around with getfeed code to bring DerStandard.at to work. Although my implementation is quite ugly (hard coded) and I still have some probs concerning charmaps and special characters the result is promising. I found out that Tommys improvement to define some start- and stoptags in the getfeedrc file would not neccessarely be enough for complexer Web-Services. I would therefore like to discuss the idea of implementing kind of modules (containing perl code) to handle specific formatting of index and content pages. As I am a Perl newcomer I would not dare to propose how this could be done the best way, so feedback is kindly welcome! Nevertheless I would be glad to provide my getfeed patch if there is any interest. Kind regards Harald Last edited by fodiator; 01-11-2008 at 08:44 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Classic The Daily	hjordanh	Barnes & Noble NOOK	3	02-05-2010 10:48 AM
Daily notifications?	devilsadvocate	Feedback	8	01-22-2010 12:24 PM
Daily Dilbert	billbadger	Calibre	2	12-09-2009 02:42 PM
Daily Comics	billbadger	Calibre	0	12-08-2009 07:22 PM
Amazon Daily	daffy4u	Amazon Kindle	13	06-04-2008 07:07 PM

10-20-2007, 07:28 PM	#17
ebookie Entrepreneur Posts: 36 Karma: 10 Join Date: Oct 2007 Location: California Device: Iliad v2	I am struck by how cool this could be if it were done legitimately. What I mean is if you came up with a way to pay an author for his or her reportage and a way to select what you were willing to pay for the article you could actually create something really useful out of this. Rather than try to steal the content out from under some web site which is using it to generate advertising revenue which pays the salaries of the people who are running the site in the first place. It is too bad that "real" newspapers are so hung up on "protecting" their cash cow (which is hemmoraging but they can't start raising a new cow before it dies somehow) that they don't really "get" this opportunity. --Chuck

10-20-2007, 08:01 PM	#18
kovidgoyal creator of calibre Posts: 45,480 Karma: 27757440 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I just noticed this thread, I've made a lot more progress on this, though for the SONY Reader. I can generate beautifully formatted LRF files with nice hierarchical table of contents from the RSS of the nytimes, bbc and newsweek. It uses the print version of the articles, so no pictures, but otherwise generates a very pretty ebook. It's based on a pretty simple plug-in system that should allow people to write plugins for their favorite feeds. All part of libprs500.

11-02-2007, 10:40 PM	#20
ebookie Entrepreneur Posts: 36 Karma: 10 Join Date: Oct 2007 Location: California Device: Iliad v2	I agree with you Tommy that pulling the RSS feed and putting it on the Iliad is a perfectly legitimate use of the feed. I wrote something similar to your perl script in python. The "stealing" part involves fetching the whole story, stripping off the window dressing that it had on its web site, and putting that on the Iliad. The provider of the feed expects you to click a link in your RSS reader and to go to their web site, which will display a bunch of annoying ads and on the off chance you click on one will pay them a bit of coin. So if you suck the story off the site and strip out their ads and such and put it on the Iliad they think of that as 'stealing' their content, just like they complain when people put their web page in a frame with someone else's advertising outside the frame. I make no claim as to the rightness or wrongness of this, but for better or worse it is the current business model people like Reuters, AP, Etc use to "monetize" their work (that is code for get paid for having people do this all day long). I managed to get AP to tell me what it would cost to push the whole story to an Illiad and they said between $400 - $600 per story depending on how many people it was being sent to (I know that probably doesn't make sense but they see it as a way of collecting a fraction of the money you will be making off the story as sized by your readership, they are stuck in the magazine/newspaper model where number of subscribers determines what you can charge for ads, so if you have a lot of subscribers you can charge a lot for ads and make more per page, etc etc.) Personally I'd like to cut AP out of the loop, basically create an automated system whereby people could submit a story for publication, pay them a fixed price for it, and then put together a newspaper from the best stories. But some people can't write, and other people are carrying some sekrit agenda (like they work for Microsoft in their day job) and so out of the chute I don't want to just pay people $500 a story but rather $1 a story and then publish it and figure out some way of measuring their credibility, as their credibility index goes up would be happy to pay them more. Sort of like reading Slashdot at a high moderation level. I figure an honest, hard working, journalist who reports a balanced account of the story is worth 500x more than one who is being compensated to be the mouthpiece of some special interest. Unfortunately there isn't a "Special Interest Lapdog" registry . So the value-add of an Associated Press is that they have, in theory, screened their journalists and pay them an appropriate amount to keep them honest. If someone wrote two decent articles a week and got paid $500 each for them that would be a pretty decent wage in many parts of the USA. Anyway, to hammer the point home. Ask any "famous" blogger for permission to pull their blog entries and publish them in your e-paper magazine. I expect most of them would ask you to pay them for that right, and if you said "But I don't pay anything to read you blog on Blogger" they will say but they get advertising revenue from visits to their blog page that they wouldn't get from you. So if you re-published them without their permission they might call it 'stealing' from them. --Chuck

12-26-2007, 10:25 AM	#22
thetechnobear Connoisseur Posts: 65 Karma: 256 Join Date: Nov 2007 Location: Switzerland Device: Iliad, Kindle K3, iPad , iPhone, etc...	some feeds are harder than others... as the script needs to remove 'code' from the newsfeed. a modification i made (easy enough if you look at the scripts) is to get the scripts to download the 'print' version rather than the webversion (usually if you go to print, you will see its a modification of the original URL which you got from the RSS feed) the print version often (havent checked the spiegel) has less code and formatting and so the scrapping works better. tommy, are you still around these parts? if so i could send you my modifications for inclusion if you wish.