![]() |
#1 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Rules for mediapart.fr and rue89.com (french news websites)
One new rule for rue89.com, free french news website:
Spoiler:
And a widely improved one (original version by Mathieu Godlewski) for Mediapart, a famous online-only newspaper with paying subscription: Spoiler:
I've been testing them for a few days, but there's probably room for improvement. |
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: May 2011
Device: sony PRS-650
|
Hello, Mediapart recently changed their home page, making the login form the first in the page, which broke the mediapart collection script.
I have updated the existing rule to account for that and it works pretty well : http://arzur.net/2011/05/22/calibre-mediapart-ftw/ (in french) I just which br.select_form could address the form's id attribute, or Mediapart would put a name= for their form :-) |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Great, thanks for the update, I saw it wasn't working anymore but was too busy recently to take the time to fix it.
And happy to see there are other users around ![]() |
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Dec 2011
Device: Kindle
|
This recipe seems broken, at least for me it fetches only rubbish. Does it still work for you?
Best regards -br |
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Oops, sorry, I made an update to the Mediapart one some time ago, but forgot to share.
I've switched to the print version, which they much improved on the site (most of the code existed but was commented out) Spoiler:
I don't have a fix for rue89 right now though, I'll try to find the time to look into it. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Dec 2011
Device: Kindle
|
Hi,
great work, thanks a lot! That will make mediapart far more comfortable to read. Have a nice week-end! -br |
![]() |
![]() |
![]() |
#7 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Ok, it still needs some polishing (gets a little bit of garbage in some articles), but I've made the rue89 recipe work again.
I've put both recipes in a git repo: https://github.com/AltGr/Calibre-french-news-rules The one for Mediapart there is updated also |
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Ok, think I got rid of the garbage, anyone is welcome to test & report the new rue89 recipe.
Video articles ("zapnet") should be removed but it seems they're not parsed correctly (the soup has everything within <script> tags) ; any hints how to detect and remove them ? |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,235
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Add a preprocess_regexp to your recipe to remove <script>.*?</script>
|
![]() |
![]() |
![]() |
#10 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
That helped, thanks. I think the recipe is alright now -- except for a margin to the left that I can't get rid off.
I removed the different feeds from that site because they are mostly overlapping (it's more tags than sections ; ); there is no way to detect multiple links to the same article and make them point to the same place in the ebook at the moment, is there ? |
![]() |
![]() |
![]() |
#11 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,235
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can have calibre ignore duplicate links easily see the sticky for a technique to do that. But there is no easy way to have the entries point to a single place in the book.
|
![]() |
![]() |
![]() |
#12 |
Member
![]() Posts: 18
Karma: 10
Join Date: Sep 2010
Device: Kindle 3 3G intl
|
Didn't work on the multi-RSS yet, but I just pushed a few fixes to the git.
|
![]() |
![]() |
![]() |
#13 | |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2013
Device: kindle 4
|
Hello, mediapart had taken a new editorial form and i think that have break the current recipe.
Sorry for my bad broken english ![]() I post the error: Quote:
|
|
![]() |
![]() |
![]() |
#14 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2013
Device: kindle 4
|
Hello, i have update the recipe. That the first time i try this but it work
![]() What i have changed : Code:
#1 link = soup.find('a', {'title':'Imprimer'}) link = soup.find('a', {'href':re.compile('^/print/[0-9]+')}) #2 return link['href'] return 'http://www.mediapart.fr' + link['href'] #3 br.open('http://www.mediapart.fr/') br.open('http://blogs.mediapart.fr/editions/guide-du-coordonnateur-d-edition') #4 br.select_form(nr=0) br.select_form(nr=1) #5 i have also add: masthead_url = 'https://upload.wikimedia.org/wikipedia/fr/2/23/Mediapart.png' Code:
__license__ = 'GPL v3' __copyright__ = '2009, Mathieu Godlewski <mathieu at godlewski.fr>; 2010-2012, Louis Gesbert <meta at antislash dot info>; 2013, Malah <malah at neuf.fr>' ''' Mediapart ''' __author__ = '2009, Mathieu Godlewski <mathieu at godlewski.fr>; 2010-2012, Louis Gesbert <meta at antislash dot info>; 2013, Malah <malah at neuf.fr>' from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag from calibre.web.feeds.news import BasicNewsRecipe class Mediapart(BasicNewsRecipe): title = 'Mediapart' __author__ = 'Mathieu Godlewski, Louis Gesbert, Malah' description = 'Global news in french from news site Mediapart' oldest_article = 7 language = 'fr' needs_subscription = True max_articles_per_feed = 50 use_embedded_content = False no_stylesheets = True masthead_url = 'https://upload.wikimedia.org/wikipedia/fr/2/23/Mediapart.png' cover_url = 'http://static.mediapart.fr/files/pave_mediapart.jpg' feeds = [ ('Les articles', 'http://www.mediapart.fr/articles/feed'), ] # -- print-version conversion_options = { 'smarten_punctuation' : True } remove_tags = [ dict(name='div', attrs={'class':'print-source_url'}) ] def print_version(self, url): raw = self.browser.open(url).read() soup = BeautifulSoup(raw.decode('utf8', 'replace')) link = soup.find('a', {'href':re.compile('^/print/[0-9]+')}) if link is None: return None return 'http://www.mediapart.fr' + link['href'] # -- Handle login def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('http://blogs.mediapart.fr/editions/guide-du-coordonnateur-d-edition') br.select_form(nr=1) br['name'] = self.username br['pass'] = self.password br.submit() return br def preprocess_html(self, soup): for title in soup.findAll('p', {'class':'titre_page'}): title.name = 'h3' for legend in soup.findAll('span', {'class':'legend'}): legend.insert(0, Tag(soup, 'br', [])) legend.name = 'small' return soup Last edited by malah; 08-07-2013 at 04:48 PM. |
![]() |
![]() |
![]() |
#15 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2013
Device: kindle 4
|
Hello, since 2 days the printed version of mediapart have a small change and that break the recipe. I haven't find how to use it simply, and that because, the new recipe does not use the printed version, dont know if that the best way, but it work.
Code:
__license__ = 'GPL v3' __copyright__ = '2009, Mathieu Godlewski <mathieu at godlewski.fr>; 2010-2012, Louis Gesbert <meta at antislash dot info>; 2013, Malah <malah at neuf dot fr>' ''' Mediapart ''' __author__ = '2009, Mathieu Godlewski <mathieu at godlewski.fr>; 2010-2012, Louis Gesbert <meta at antislash dot info>; 2013, Malah <malah at neuf dot fr>' from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag from calibre.web.feeds.news import BasicNewsRecipe class Mediapart(BasicNewsRecipe): title = 'Mediapart' __author__ = 'Mathieu Godlewski, Louis Gesbert, Malah' description = 'Global news in french from news site Mediapart' oldest_article = 7 language = 'fr' needs_subscription = True max_articles_per_feed = 50 use_embedded_content = False no_stylesheets = True masthead_url = 'https://upload.wikimedia.org/wikipedia/fr/2/23/Mediapart.png' cover_url = 'http://static.mediapart.fr/files/pave_mediapart.jpg' feeds = [ ('Les articles', 'http://www.mediapart.fr/articles/feed'), ] # -- full-page-version conversion_options = { 'smarten_punctuation' : True } keep_only_tags = [ dict(name='div', attrs={'class':'col-left fractal-desktop fractal-10-desktop collapse-7-desktop fractal-tablet fractal-6-tablet collapse-4-tablet'}), dict(name='div', attrs={'id':'pageFirstContent'}) ] remove_tags = [ dict(name='div', attrs={'id':'lire-aussi'}), dict(name='div', attrs={'class':'col-right-content'}) ] def print_version(self, url): raw = self.browser.open(url).read() soup = BeautifulSoup(raw.decode('utf8', 'replace')) link = soup.find('a', {'href':re.compile('^.*?onglet=full$')}) if link is None: return None return link['href'] # -- Handle login def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('http://blogs.mediapart.fr/editions/guide-du-coordonnateur-d-edition') br.select_form(nr=1) br['name'] = self.username br['pass'] = self.password br.submit() return br def preprocess_html(self, soup): for title in soup.findAll('p', {'class':'titre_page'}): title.name = 'h3' for legend in soup.findAll('span', {'class':'legend'}): legend.insert(0, Tag(soup, 'br', [])) legend.name = 'small' return soup Last edited by malah; 08-07-2013 at 04:49 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
This Site RULES! | Drezin | News | 3 | 12-31-2008 10:10 AM |
@page rules | mtravellerh | Calibre | 1 | 12-28-2008 05:01 PM |
Rules | Alexander Turcic | Flea Market | 0 | 05-13-2008 03:35 AM |