Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-01-2011, 05:47 PM   #1
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
Folha de São Paulo - Printed Edition

Guys, finally managed to produce a working recipe for the printed edition of the famous Brazilian newspaper Folha de São Paulo. I urge everyone to provide some feedback and help me to tackle the long list of pending issues.

What does the recipe currently do?
1. Logs in using a UOL login.
2. Recognizes sections.
3. Downloads all articles from current edition and assign them to the correct section.

To do list:
1. It takes 15 minutes to run the recipe. Can we improve its speed?
2. Section names come from <a name=””> attributes and sometimes are truncated (eg. Ilustrada is shown as ilustra). Should be easy to fix with a dictionary.
3. Get rid of the copyright footer and the “Texto Anterior” and “Próximo Texto” bits.
4. General beautification/cleanup of the articles.
5. Get publication date and use it approprietly.
6. Get masterhead. DONE
7. Find the current cover and use it as cover object.
8. Fix the name to Folha de São Paulo, with ~. DONE
9. Currently works for UOL subscribers. Ideally, should also work for FOLHA subscribers.
10. Allow users to decide which sections they want to download (eg. Never download Campinas, Ribeirão, Comida).
11. The first three articles are usually “capa” (which is the website cover), “fac-simile da capa” (which is the actual newspaper front-page) and “arquivo”. Decide what to do with those.
12. Error message if login/password is wrong.

Having said all that, I am glad it works the way it currently is.

Brasileirada, mandem o feedback e dêem uma mãozinha.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class FSP(BasicNewsRecipe):

    title      = u'Folha de S\xE3o Paulo - Printed Edition'
    __author__ = 'fluzao'
    description = u'Folha de S\xE3o Paulo - Printed Edition (UOL subscription required)'
    INDEX = 'http://www1.folha.uol.com.br/fsp/indices/'
    language = 'pt'
    no_stylesheets = True
    max_articles_per_feed  = 30
    remove_javascript     = True
    needs_subscription = True
    remove_tags_before = dict(name='b')
    remove_tags_after  = dict(name='!--/NOTICIA--')
    remove_attributes = ['height','width']
    masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'
        
    # this solves the problem with truncated content in Kindle
    conversion_options = {'linearize_tables' : True}
	
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('https://acesso.uol.com.br/login.html')
            br.form = br.forms().next()
            br['user']   = self.username
            br['pass'] = self.password
            raw = br.submit().read()
##            if 'Please try again' in raw:
##                raise Exception('Your username and password are incorrect')
        return br


    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        articles = []
        section_title = "Preambulo"
        for post in soup.findAll('a'):
            # if name=True => new section
            strpost = str(post)
            if strpost.startswith('<a name'):
                if articles:
                    feeds.append((section_title, articles))
                    self.log()
                    self.log('--> new section found, creating old section feed: ', section_title)
                section_title = post['name']
                articles = []
                self.log('--> new section title:   ', section_title)
            if strpost.startswith('<a href'):
                url = post['href']
                if url.startswith('/fsp'):
                    url = 'http://www1.folha.uol.com.br'+url
                    title = self.tag_to_string(post)
                    self.log()
                    self.log('--> post:  ', post)
                    self.log('--> url:   ', url)
                    self.log('--> title: ', title)
                    articles.append({'title':title, 'url':url})
        if articles:
            feeds.append((section_title, articles))
        return feeds
fluzao is offline   Reply With Quote
Old 09-27-2011, 05:09 PM   #2
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
Please find a working version of the Folha de São Paulo printed edition recipe. I believe it is ready to be included in the next update cycle of calibre.

Please note this allow UOL subscribers to access the full content of the newspaper. The current recipe doesn't require a subscription, but accesses a smaller content. I believe it shouldn't be deleted.

I've addressed some of the issues mentioned in the previous post:

2. Section names come from <a name=””> attributes and sometimes are truncated (eg. Ilustrada is shown as ilustra). Should be easy to fix with a dictionary. DONE
6. Get masterhead. DONE
7. Find the current cover and use it as cover object. DONE
8. Fix the name to Folha de São Paulo, with ~. DONE
11. The first three articles are usually “capa” (which is the website cover), “fac-simile da capa” (which is the actual newspaper front-page) and “arquivo”. Decide what to do with those. DONE

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class FSP(BasicNewsRecipe):

    title      = u'Folha de S\xE3o Paulo - Jornal'
    __author__ = 'fluzao'
    description = u'Printed edition contents. UOL subscription required (Folha subscription currently not supported).' + \
                  u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]'
    INDEX = 'http://www1.folha.uol.com.br/fsp/indices/'
    language = 'pt'
    no_stylesheets = True
    max_articles_per_feed  = 30
    remove_javascript     = True
    needs_subscription = True
    remove_tags_before = dict(name='b')
    remove_tags_after  = dict(name='!--/NOTICIA--')
    remove_attributes = ['height','width']
    masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'

    # fixes the problem with the section names
    section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada', \
                    'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o', \
                    'ciencia' : u'ci\xeancia' , 'saude' : u'sa\xfade', \
                    'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio'}

    # this solves the problem with truncated content in Kindle
    conversion_options = {'linearize_tables' : True}
	
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('https://acesso.uol.com.br/login.html')
            br.form = br.forms().next()
            br['user']   = self.username
            br['pass'] = self.password
            raw = br.submit().read()
##            if 'Please try again' in raw:
##                raise Exception('Your username and password are incorrect')
        return br


    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        articles = []
        section_title = "Preambulo"
        for post in soup.findAll('a'):
            # if name=True => new section
            strpost = str(post)
            if strpost.startswith('<a name'):
                if articles:
                    feeds.append((section_title, articles))
                    self.log()
                    self.log('--> new section found, creating old section feed: ', section_title)
                section_title = post['name']
                if section_title in self.section_dict:
                    section_title = self.section_dict[section_title]
                articles = []
                self.log('--> new section title:   ', section_title)
            if strpost.startswith('<a href'):
                url = post['href']
                if url.startswith('/fsp'):
                    url = 'http://www1.folha.uol.com.br'+url
                    title = self.tag_to_string(post)
                    self.log()
                    self.log('--> post:  ', post)
                    self.log('--> url:   ', url)
                    self.log('--> title: ', title)
                    articles.append({'title':title, 'url':url})
        if articles:
            feeds.append((section_title, articles))

        # keeping the front page url
        minha_capa = feeds[0][1][1]['url']

        # removing the 'Preambulo' section
        del feeds[0]
        
        # creating the url for the cover image
        coverurl = feeds[0][1][0]['url']
        coverurl = coverurl.replace('/opiniao/fz', '/images/cp')
        coverurl = coverurl.replace('01.htm', '.jpg')
        self.cover_url = coverurl

        # inserting the cover page as the first article (nicer for kindle users)
        feeds.insert(0,(u'primeira p\xe1gina', [{'title':u'Primeira p\xe1gina' , 'url':minha_capa}]))
        return feeds
Attached Files
File Type: zip folhadesaopaulo_printed.zip (1.6 KB, 77 views)
fluzao is offline   Reply With Quote
 
Advertisement
Old 09-28-2011, 02:00 PM   #3
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
3. Get rid of the copyright footer and the “Texto Anterior” and “Próximo Texto” bits. DONE

Improved recipe (also attached):

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

import string, re
class FSP(BasicNewsRecipe):

    title      = u'Folha de S\xE3o Paulo'
    __author__ = 'fluzao'
    description = u'Printed edition contents. UOL subscription required (Folha subscription currently not supported).' + \
                  u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]'
    INDEX = 'http://www1.folha.uol.com.br/fsp/indices/'
    language = 'pt'
    no_stylesheets = True
    max_articles_per_feed  = 40
    remove_javascript     = True
    needs_subscription = True
    remove_tags_before = dict(name='b')
    remove_tags  = [dict(name='td', attrs={'align':'center'})]
    remove_attributes = ['height','width']
    masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'

    # fixes the problem with the section names
    section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada', \
                    'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o', \
                    'ciencia' : u'ci\xeancia' , 'saude' : u'sa\xfade', \
                    'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio'}

    # this solves the problem with truncated content in Kindle
    conversion_options = {'linearize_tables' : True}

    # this bit removes the footer where there are links for Proximo Texto, Texto Anterior,
    #    Indice e Comunicar Erros
    preprocess_regexps = [(re.compile(r'<BR><BR>Texto Anterior:.*<!--/NOTICIA-->',
                                      re.DOTALL|re.IGNORECASE), lambda match: r''),
                          (re.compile(r'<BR><BR>Pr&oacute;ximo Texto:.*<!--/NOTICIA-->',
                                      re.DOTALL|re.IGNORECASE), lambda match: r'')]  
	
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('https://acesso.uol.com.br/login.html')
            br.form = br.forms().next()
            br['user']   = self.username
            br['pass'] = self.password
            raw = br.submit().read()
##            if 'Please try again' in raw:
##                raise Exception('Your username and password are incorrect')
        return br


    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        articles = []
        section_title = "Preambulo"
        for post in soup.findAll('a'):
            # if name=True => new section
            strpost = str(post)
            if strpost.startswith('<a name'):
                if articles:
                    feeds.append((section_title, articles))
                    self.log()
                    self.log('--> new section found, creating old section feed: ', section_title)
                section_title = post['name']
                if section_title in self.section_dict:
                    section_title = self.section_dict[section_title]
                articles = []
                self.log('--> new section title:   ', section_title)
            if strpost.startswith('<a href'):
                url = post['href']
                if url.startswith('/fsp'):
                    url = 'http://www1.folha.uol.com.br'+url
                    title = self.tag_to_string(post)
                    self.log()
                    self.log('--> post:  ', post)
                    self.log('--> url:   ', url)
                    self.log('--> title: ', title)
                    articles.append({'title':title, 'url':url})
        if articles:
            feeds.append((section_title, articles))

        # keeping the front page url
        minha_capa = feeds[0][1][1]['url']

        # removing the 'Preambulo' section
        del feeds[0]
        
        # creating the url for the cover image
        coverurl = feeds[0][1][0]['url']
        coverurl = coverurl.replace('/opiniao/fz', '/images/cp')
        coverurl = coverurl.replace('01.htm', '.jpg')
        self.cover_url = coverurl

        # inserting the cover page as the first article (nicer for kindle users)
        feeds.insert(0,(u'primeira p\xe1gina', [{'title':u'Primeira p\xe1gina' , 'url':minha_capa}]))
        return feeds
Attached Files
File Type: zip folhadesaopaulo_printed.zip (1.8 KB, 61 views)

Last edited by fluzao; 09-28-2011 at 02:10 PM. Reason: max_articles_per_feed fix (2 to 40)
fluzao is offline   Reply With Quote
Old 11-09-2011, 07:07 AM   #4
luis.nando
Member
luis.nando began at the beginning.
 
Posts: 10
Karma: 18
Join Date: Aug 2011
Device: Kindle 3
Not Working

It suddenly stopped working two days ago (same error in two diferent computers), the message is:

calibre, version 0.8.25
Spoiler:
Code:
ERROR: Erro ao converter: <b>Falha</b>: Obter notícias de Folha de São Paulo

Obter notícias de Folha de São Paulo
Resolved conversion options
calibre version: 0.8.25
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'dont_split_on_page_breaks': True,
 'duplicate_links_in_toc': False,
 'enable_heuristics': False,
 'epub_flatten': False,
 'extra_css': None,
 'extract_to': None,
 'fix_indents': True,
 'flow_size': 260,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x04EAC950>,
 'insert_blank_line': False,
 'insert_blank_line_size': 0.5,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x04EACB30>,
 'page_breaks_before': None,
 'password': 'lfsnbr1',
 'prefer_metadata_cover': False,
 'preserve_cover_aspect_ratio': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_fake_margins': True,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': '',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': '',
 'sr1_search': '',
 'sr2_replace': '',
 'sr2_search': '',
 'sr3_replace': '',
 'sr3_search': '',
 'tags': None,
 'test': False,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'unsmarten_punctuation': False,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'username': 'lfsca',
 'verbose': 2}
InputFormatPlugin: Recipe Input running

--> post:   <a href="/fsp/">capa</a>
--> url:    http://www1.folha.uol.com.br/fsp/
--> title:  capa

--> post:   <a href="/fsp/arquivo.htm">arquivo</a>
--> url:    http://www1.folha.uol.com.br/fsp/arquivo.htm
--> title:  arquivo

--> new section found, creating old section feed:  Preambulo
--> new section title:    ciência
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 132, in main
  File "site.py", line 109, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 187, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 949, in run
  File "site-packages\calibre\customize\conversion.py", line 204, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 105, in convert
  File "site-packages\calibre\web\feeds\news.py", line 824, in download
  File "site-packages\calibre\web\feeds\news.py", line 961, in build_index
  File "c:\users\consul~1\appdata\local\temp\calibre_0.8.25_tmp_31lyds\kcvodl_recipes\recipe0.py", line 93, in parse_index
    coverurl = feeds[0][1][0]['url']
IndexError: list index out of range

Last edited by Starson17; 11-09-2011 at 10:26 AM.
luis.nando is offline   Reply With Quote
Old 11-10-2011, 12:49 PM   #5
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
It is working again. They had a bug in their index page for a couple of days.

Quote:
Originally Posted by luis.nando View Post
It suddenly stopped working two days ago (same error in two diferent computers), the message is:

calibre, version 0.8.25
Spoiler:
Code:
ERROR: Erro ao converter: <b>Falha</b>: Obter notícias de Folha de São Paulo

Obter notícias de Folha de São Paulo
Resolved conversion options
calibre version: 0.8.25
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'dont_split_on_page_breaks': True,
 'duplicate_links_in_toc': False,
 'enable_heuristics': False,
 'epub_flatten': False,
 'extra_css': None,
 'extract_to': None,
 'fix_indents': True,
 'flow_size': 260,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x04EAC950>,
 'insert_blank_line': False,
 'insert_blank_line_size': 0.5,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x04EACB30>,
 'page_breaks_before': None,
 'password': 'lfsnbr1',
 'prefer_metadata_cover': False,
 'preserve_cover_aspect_ratio': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_fake_margins': True,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': '',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': '',
 'sr1_search': '',
 'sr2_replace': '',
 'sr2_search': '',
 'sr3_replace': '',
 'sr3_search': '',
 'tags': None,
 'test': False,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'unsmarten_punctuation': False,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'username': 'lfsca',
 'verbose': 2}
InputFormatPlugin: Recipe Input running

--> post:   <a href="/fsp/">capa</a>
--> url:    http://www1.folha.uol.com.br/fsp/
--> title:  capa

--> post:   <a href="/fsp/arquivo.htm">arquivo</a>
--> url:    http://www1.folha.uol.com.br/fsp/arquivo.htm
--> title:  arquivo

--> new section found, creating old section feed:  Preambulo
--> new section title:    ciência
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 132, in main
  File "site.py", line 109, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 187, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 949, in run
  File "site-packages\calibre\customize\conversion.py", line 204, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 105, in convert
  File "site-packages\calibre\web\feeds\news.py", line 824, in download
  File "site-packages\calibre\web\feeds\news.py", line 961, in build_index
  File "c:\users\consul~1\appdata\local\temp\calibre_0.8.25_tmp_31lyds\kcvodl_recipes\recipe0.py", line 93, in parse_index
    coverurl = feeds[0][1][0]['url']
IndexError: list index out of range
fluzao is offline   Reply With Quote
Old 11-11-2011, 05:16 AM   #6
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
Worked yesterday, didn't work today. I'm trying to understand what is going on, but I guess they are making changes to the website. Will keep you posted.
fluzao is offline   Reply With Quote
Old 11-13-2011, 09:02 PM   #7
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
Ok guys, problem solved. Fixed - and slightly improved - recipe below and attached. Kovid, please update it in the database when you have a chance.

Code:
from calibre.web.feeds.news import BasicNewsRecipe

import re

class FSP(BasicNewsRecipe):

    title      = u'Folha de S\xE3o Paulo'
    __author__ = 'fluzao'
    description = u'Printed edition contents. UOL subscription required (Folha subscription currently not supported).' + \
                  u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]'

    #found this to be the easiest place to find the index page (13-Nov-2011).
    #  searching for the "Indice Geral" link
    HOMEPAGE = 'http://www1.folha.uol.com.br/fsp/'
    masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'

    language = 'pt'
    no_stylesheets = True
    max_articles_per_feed  = 40
    remove_javascript     = True
    needs_subscription = True

    remove_tags_before = dict(name='p')
    remove_tags  = [dict(name='td', attrs={'align':'center'})]
    remove_attributes = ['height','width']
    # fixes the problem with the section names
    section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada', \
                    'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o', \
                    'ciencia' : u'ci\xeancia' , 'saude' : u'sa\xfade', \
                    'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio', \
                    'imoveis' : u'im\xf3veis', 'negocios' : u'neg\xf3cios', \
                    'veiculos' : u've\xedculos', 'corrida' : 'folha corrida'}

    # this solves the problem with truncated content in Kindle
    conversion_options = {'linearize_tables' : True}

    # this bit removes the footer where there are links for Proximo Texto, Texto Anterior,
    #    Indice e Comunicar Erros
    preprocess_regexps = [(re.compile(r'<!--/NOTICIA-->.*Comunicar Erros</a>',
                                      re.DOTALL|re.IGNORECASE), lambda match: r'')]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('https://acesso.uol.com.br/login.html')
            br.form = br.forms().next()
            br['user']   = self.username
            br['pass'] = self.password
            br.submit().read()
##            if 'Please try again' in raw:
##                raise Exception('Your username and password are incorrect')
        return br


    def parse_index(self):
        #Searching for the index page on the HOMEPAGE
        hpsoup = self.index_to_soup(self.HOMEPAGE)
        indexref = hpsoup.find('a', href=re.compile('^indices.*'))
        self.log('--> tag containing the today s index: ', indexref)     
        INDEX = indexref['href']
        INDEX = 'http://www1.folha.uol.com.br/fsp/'+INDEX
        self.log('--> INDEX after extracting href and adding prefix: ', INDEX)
        # ... and taking the opportunity to get the cover image link
        coverurl = hpsoup.find('a', href=re.compile('^cp.*'))['href']
        if coverurl:
            self.log('--> tag containing the today s cover: ', coverurl)
            coverurl = coverurl.replace('htm', 'jpg')
            coverurl = 'http://www1.folha.uol.com.br/fsp/images/'+coverurl
            self.log('--> coverurl after extracting href and adding prefix: ', coverurl)
            self.cover_url = coverurl
        
        #soup = self.index_to_soup(self.INDEX)
        soup = self.index_to_soup(INDEX)

        feeds = []
        articles = []
        section_title = "Preambulo"
        for post in soup.findAll('a'):
            # if name=True => new section
            strpost = str(post)
            if strpost.startswith('<a name'):
                if articles:
                    feeds.append((section_title, articles))
                    self.log()
                    self.log('--> new section found, creating old section feed: ', section_title)
                section_title = post['name']
                if section_title in self.section_dict:
                    section_title = self.section_dict[section_title]
                articles = []
                self.log('--> new section title:   ', section_title)
            if strpost.startswith('<a href'):
                url = post['href']
                #this bit is kept if they ever go back to the old format (pre Nov-2011)
                if url.startswith('/fsp'):
                    url = 'http://www1.folha.uol.com.br'+url
                #
                if url.startswith('http://www1.folha.uol.com.br/fsp'):
                    #url = 'http://www1.folha.uol.com.br'+url
                    title = self.tag_to_string(post)
                    self.log()
                    self.log('--> post:  ', post)
                    self.log('--> url:   ', url)
                    self.log('--> title: ', title)
                    articles.append({'title':title, 'url':url})
        if articles:
            feeds.append((section_title, articles))

        # keeping the front page url
        minha_capa = feeds[0][1][1]['url']

        # removing the first section (now called 'top')
        del feeds[0]

        # inserting the cover page as the first article (nicer for kindle users)
        feeds.insert(0,(u'primeira p\xe1gina', [{'title':u'Primeira p\xe1gina' , 'url':minha_capa}]))
        return feeds
Attached Files
File Type: zip folhadesaopaulo_sub.zip (2.0 KB, 87 views)
fluzao is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Request: Folha de Sao Paulo (Brazil) from UOL portal luis.nando Recipes 6 09-27-2011 10:04 AM
Updated recipe for Folha de Sao Paulo (Brazil) Alex Mitrani Recipes 0 08-18-2011 08:35 PM
Upgrade recipe for Folha de São Paulo and Estadão with cover euleralves Recipes 4 03-31-2011 02:02 AM
Hello from Sao Paulo, Brazil jglerner Introduce Yourself 8 02-17-2010 02:33 PM
Hi from Sao Paulo! lorisgirl Introduce Yourself 4 03-18-2009 01:08 PM


All times are GMT -4. The time now is 01:15 AM.


MobileRead.com is a privately owned, operated and funded community.