05-15-2016, 10:36 PM | #17 |
Member
Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Hi guys,
Why dont you use the receipt created by Euler Alves and Alex Mitrani, that it is already included on Calibri last version (no UOL login required)? Since it was not updated, I´ve easily improved it by including new columnists and sections. Works perfectly. |
Advert | |
|
05-16-2016, 07:14 PM | #18 |
Member
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Their recipe is great, you are right. However, there is a lot of bad journalistic material that goes into Folha website, that does not creeps in the printed version. Also, AFAIK there is no way of isolating what went into the printed version from the aggregated .rss articles.
Thus, I can see use for both recipes, they are complementary. |
05-16-2016, 10:42 PM | #19 |
Member
Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Luis, you are right about the printed version issue, but the quality of the articles are great, no "bad materials" are coming.
Anyway, there is my improved version of the original receipt. Cover, sections and columnists updated. Code:
from calibre.web.feeds.news import BasicNewsRecipe from datetime import datetime, timedelta from calibre.ebooks.BeautifulSoup import Tag,BeautifulSoup from calibre.utils.magick import Image, PixelWand from urllib2 import Request, urlopen, URLError class FolhaOnline(BasicNewsRecipe): THUMBALIZR_API = '' # ---->Get your at http://www.thumbalizr.com/ and put here LANGUAGE = 'pt_br' language = 'pt_BR' LANGHTM = 'pt-br' ENCODING = 'cp1252' ENCHTM = 'iso-8859-1' directionhtm = 'ltr' requires_version = (0,7,47) news = True title = u'Folha de S\xE3o Paulo improved' __author__ = 'Euler Alves and Alex Mitrani, improved by Bola de Fogo' description = u'Brazilian news from Folha de S\xE3o Paulo' publisher = u'Folha de S\xE3o Paulo' category = 'news, rss' oldest_article = 4 max_articles_per_feed = 200 summary_length = 1000 remove_javascript = True no_stylesheets = True use_embedded_content = False remove_empty_feeds = True timefmt = ' [%d %b %Y (%a)]' html2lrf_options = [ '--comment', description ,'--category', category ,'--publisher', publisher ] html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' hoje = datetime.now() pubdate = hoje.strftime('%a, %d %b') if hoje.hour<6: hoje = hoje-timedelta(days=1) CAPA = 'http://img.kiosko.net/'+hoje.strftime('%Y')+'/'+hoje.strftime('%m')+'/'+hoje.strftime('%d')+'/br/br_folha_spaulo.200.jpg' SCREENSHOT = 'http://www1.folha.uol.com.br/' cover_margins = (0,0,'white') masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif' keep_only_tags = [ dict(name='div', attrs={'id':'articleNew'}), dict(name='article', id='news'), ] feeds = [ (u'Em cima da hora', u'http://feeds.folha.uol.com.br/emcimadahora/rss091.xml') ,(u'Poder', u'http://feeds.folha.uol.com.br/poder/rss091.xml') ,(u'Cotidiano', u'http://feeds.folha.uol.com.br/cotidiano/rss091.xml') ,(u'Mercado', u'http://feeds.folha.uol.com.br/mercado/rss091.xml') ,(u'Mundo', u'http://feeds.folha.uol.com.br/mundo/rss091.xml') ,(u'Esporte', u'http://feeds.folha.uol.com.br/esporte/rss091.xml') ,(u'Comida', u'http://feeds.folha.uol.com.br/comida/rss091.xml') ,(u'Tec', u'http://feeds.folha.uol.com.br/tec/rss091.xml') ,(u'Ilustrada', u'http://feeds.folha.uol.com.br/ilustrada/rss091.xml') ,(u'Ambiente', u'http://feeds.folha.uol.com.br/ambiente/rss091.xml') ,(u'Opiniao', u'http://feeds.folha.uol.com.br/opiniao/rss091.xml') ,(u'Ci\xEAncia', u'http://feeds.folha.uol.com.br/ciencia/rss091.xml') ,(u'Equil\xEDbrio e Sa\xFAde', u'http://feeds.folha.uol.com.br/equilibrioesaude/rss091.xml') ,(u'Elio Gaspari', u'http://feeds.folha.uol.com.br/colunas/eliogaspari/rss091.xml') ,(u'Tati Bernardi', u'http://feeds.folha.uol.com.br/colunas/tatibernardi/rss091.xml') ,(u'PVC', u'http://feeds.folha.uol.com.br/colunas/pvc/rss091.xml') ,(u'Clóvis Rossi', u'http://feeds.folha.uol.com.br/colunas/clovisrossi/rss091.xml') ,(u'Hélio Schwartsman', u'http://feeds.folha.uol.com.br/colunas/helioschwartsman/rss091.xml') ,(u'Humberto Luiz Peron', u'http://feeds.folha.uol.com.br/colunas/futebolnarede/rss091.xml') ,(u'João Pereira Coutinho', u'http://feeds.folha.uol.com.br/colunas/joaopereiracoutinho/rss091.xml') ,(u'Cony', u'http://feeds.folha.uol.com.br/colunas/carlosheitorcony/rss091.xml') ,(u'Juca', u'http://feeds.folha.uol.com.br/colunas/jucakfouri/rss091.xml') ,(u'Viniciu Torres Freitas', u'http://feeds.folha.uol.com.br/colunas/viniciustorres/rss091.xml') ,(u'Monica Bergamo', u'http://feeds.folha.uol.com.br/colunas/monicabergamo/rss091.xml') ,(u'Vinicius Mota', u'http://feeds.folha.uol.com.br/colunas/viniciusmota/rss091.xml') ,(u'Bernardo Guimaraes', u'http://aeconomianoseculo21.blogfolha.uol.com.br/feed/') ,(u'Tostao', u'http://feeds.folha.uol.com.br/colunas/tostao/rss091.xml') ,(u'Valdo Cruz', u'http://feeds.folha.uol.com.br/colunas/valdocruz/rss091.xml') ] conversion_options = { 'title' : title ,'comments' : description ,'publisher' : publisher ,'tags' : category ,'language' : LANGUAGE ,'linearize_tables': True } def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] if not soup.find(attrs={'http-equiv':'Content-Language'}): meta0 = Tag(soup,'meta',[("http-equiv","Content-Language"),("content",self.LANGHTM)]) soup.head.insert(0,meta0) if not soup.find(attrs={'http-equiv':'Content-Type'}): meta1 = Tag(soup,'meta',[("http-equiv","Content-Type"),("content","text/html; charset="+self.ENCHTM)]) soup.head.insert(0,meta1) return soup def postprocess_html(self, soup, first): # process all the images. assumes that the new html has the correct path for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and 'src' in tag): iurl = tag['src'] img = Image() img.open(iurl) width, height = img.size print 'img is: ', iurl, 'width is: ', width, 'height is: ', height if img < 0: raise RuntimeError('Out of memory') pw = PixelWand() if(width > height and width > 590) : print 'Rotate image' img.rotate(pw, -90) img.save(iurl) return soup def get_cover_url(self): cover_url = self.CAPA pedido = Request(self.CAPA) pedido.add_header('User-agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; '+self.LANGHTM+'; userid='+self.THUMBALIZR_API+') Calibre/0.8.47 (like Gecko)') pedido.add_header('Accept-Charset',self.ENCHTM) pedido.add_header('Referer',self.SCREENSHOT) try: resposta = urlopen(pedido) soup = BeautifulSoup(resposta) cover_item = soup.find('body') if cover_item: cover_url='http://api.thumbalizr.com/?api_key='+self.THUMBALIZR_API+'&url='+self.SCREENSHOT+'&width=600&quality=90' return cover_url except URLError: cover_url='http://api.thumbalizr.com/?api_key='+self.THUMBALIZR_API+'&url='+self.SCREENSHOT+'&width=600&quality=90' return cover_url |
05-17-2016, 01:08 AM | #20 |
Member
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Dear Bola de Fogo,
Thanks for sharing. Note that the calibre devs keep an eye at these posts as they updated to the latest version as soon as I linked to it. So, perhaps, it is a good idea to post the recipe at Euler's original post as well, which I couldn't find. Here is one thread about that version, though: https://www.mobileread.com/forums/sho...d.php?t=146959 Up to you if you want to leave a heads up there. Best, |
Advert | |
|
05-20-2016, 06:47 AM | #21 |
Member
Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Luis,
Thanks for the help, my FSP improved version was included at the new calibri version released today. (thanks for Mr Goyal as well) |
08-22-2017, 05:57 AM | #22 |
Member
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
We are again with a broken recipe for the Printed Folha Version. Some API changes in get_browser() may be the culprit. It would be nice if @kovidgoyal would help out on this one as I can't figure out how to correct it.
Below is the current broken code: Code:
from calibre.web.feeds.news import BasicNewsRecipe import re import datetime class FSP(BasicNewsRecipe): title = u'Folha de S\xE3o Paulo' __author__ = 'Joao Eduardo Bertacchi' description = u'Printed edition contents. UOL subscription required.' + \ u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]' today=datetime.date.today() masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif' language = 'pt_BR' no_stylesheets = True max_articles_per_feed = 100 remove_javascript = True needs_subscription = True keep_only_tags = [ dict(name='div', id='articleNew'), dict(name='table', attrs={'class':'articleGraphic'}), dict(name='article', id='news'), ] publication_type = 'newspaper' simultaneous_downloads = 5 remove_attributes = ['height','width'] # The following is an attempt to fix the problem with the section names, but whenever new sections are added it can generate accentuation problems still section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada', 'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o', 'ciencia' : u'cincia' , 'saude' : u'sa\xfade', 'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio', 'imoveis' : u'im\xf3veis', 'negocios' : u'neg\xf3cios', 'veiculos' : u've\xedculos', 'corrida' : 'folha corrida', 'turismo':'turismo'} # this solves the problem with truncated content in Kindle conversion_options = {'linearize_tables' : True} extra_css = """ #articleNew { font: 18px Times New Roman,verdana,arial; } img { background: none !important; float: none; margin: 0px; } .newstexts { list-style-type: none; height: 20px; margin: 15px 0 10px 0; } .newstexts.last { border-top: 1px solid #ccc; margin: 5px 0 15px 0; padding-top: 15px; } .newstexts li { display: inline; padding: 0 5px; } .newstexts li.prev { float: left; } .newstexts li.next { float: right; } .newstexts li span { width: 12px; height: 15px; display: inline-block; } .newstexts li.prev span { background-position: -818px -46px; } .newstexts li.next span { background-position: -832px -46px; } .newstexts li a { font: bold 12px arial, verdana, sans-serif; text-transform: uppercase; color: #999; text-decoration: none !important; } .newstexts li a:hover { text-decoration: underline !important } .headerart { font-weight: bold; } .title { font: bold 39px Times New Roman,verdana,arial; margin-bottom: 15px; margin-top: 10px; } .creditart, .origin { font: bold 12px arial, verdana, sans-serif; color: #999; margin: 0px; display: block; } .headerart p, .fine_line p { margin: 0 !important; } .fine_line { font: bold 18px Times New Roman,verdana,arial; } .fine_line p { margin-bottom: 18px !important; } .fine_line p:first-child { font-weight: normal; font-style: italic; font-size: 20px !important; } .eye { display: block; width: 317px; border-top: 2px solid #666; padding: 7px 0 7px; border-bottom: 2px solid #666; font-style: italic; font-weight: bold; } .kicker { font-weight: bold; text-transform: uppercase; font-size: 18px; font-family: Times New Roman,verdana,arial !important; } .blue { color: #000080; } .red { color: #F00; } .blue { color: #000080; } .green { color: #006400; } .orange { color: #FFA042; } .violet { color: #8A2BE2; } .text_footer { font-size: 15px; } .title_end { font-size: 23px; font-weight: bold; } .divisor { text-indent: -9999px; border-bottom: 1px solid #ccc; height: 1px; margin: 0; } .star { background: none !important; height: 15px; } .articleGraphic { margin-bottom: 20px; } """ # This is the code for login, here a mini browser is called and id entered def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('https://acesso.uol.com.br/login.html') br.form = br.forms() br['user'] = self.username br['pass'] = self.password br.submit().read() return br # Parsing the index webpage def parse_index(self): # In the last version, the index page became simpler: INDEX = 'http://www1.folha.uol.com.br/fsp/' self.log('--> INDEX set ', INDEX) soup = self.index_to_soup(INDEX) feeds = [] articles = [] section_title = u'Primeira p\xe1gina' for post in soup.findAll('a'): strpost = str(post) if re.match('<a href="http://www1.folha.uol.com.br/.*/"><span.class="', strpost): if articles: feeds.append((section_title, articles)) self.log() self.log('--> new section found, creating old section feed: ', section_title) # section_title = post['name'] section_title = self.tag_to_string(post) if section_title in self.section_dict: section_title = self.section_dict[section_title] articles = [] self.log('--> new section title: ', section_title) elif strpost.startswith('<a href="/fsp/cp'): break elif strpost.startswith('<a href'): url = post['href'] if url.startswith('http://www1.folha.uol.com.br/'): title = self.tag_to_string(post) self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) del feeds[0] return feeds Code:
calibre, version 3.4.0 (linux2, embedded-python: False) Conversion error: Falha: Fetch news from Folha de São Paulo Fetch news from Folha de São Paulo Conversion options changed from defaults: output_profile: 'kindle' verbose: 2 Resolved conversion options calibre version: 3.4.0 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_compress': False, 'dont_download_recipe': False, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': False, 'expand_css': False, 'extra_css': None, 'extract_to': None, 'filter_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x7fa8a5e11850>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'mobi_file_type': 'old', 'mobi_ignore_margins': False, 'mobi_keep_original_images': False, 'mobi_toc_at_start': False, 'no_chapters_in_toc': False, 'no_inline_navbars': True, 'no_inline_toc': False, 'output_profile': <calibre.customize.profiles.KindleOutput object at 0x7fa8a5e11f10>, 'page_breaks_before': None, 'personal_doc': '[PDOC]', 'prefer_author_sort': False, 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'share_not_sync': False, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'transform_css_rules': None, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: Recipe Input running Using custom recipe Traceback (most recent call last): File "/usr/bin/calibre-parallel", line 20, in <module> sys.exit(main()) File "/usr/lib/calibre/calibre/utils/ipc/worker.py", line 195, in main result = func(*args, **kwargs) File "/usr/lib/calibre/calibre/gui2/convert/gui_conversion.py", line 26, in gui_convert plumber.run() File "/usr/lib/calibre/calibre/ebooks/conversion/plumber.py", line 1088, in run accelerators, tdir) File "/usr/lib/calibre/calibre/customize/conversion.py", line 245, in __call__ log, accelerators) File "/usr/lib/calibre/calibre/ebooks/conversion/plugins/recipe_input.py", line 118, in convert ro = recipe(opts, log, self.report_progress) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 904, in __init__ self.browser = self.get_browser() File "<string>", line 86, in get_browser AttributeError: 'list' object has no attribute 'next' |
08-22-2017, 07:20 AM | #23 |
creator of calibre
Posts: 43,796
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That is already fixed in the builtin recipe.
|
08-22-2017, 07:29 AM | #24 |
Member
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Thank you! It is working perfectly...
|
10-24-2017, 04:36 AM | #25 | |
Member
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Hi everyone,
We are back with a broken recipe. The error is: Quote:
|
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Updated recipe for Folha de Sao Paulo (Brazil) | XaleM | Recipes | 2 | 02-04-2018 05:44 PM |
New recipe for Folha de Sao Paulo - printed version | joaobertacchi | Recipes | 3 | 06-16-2013 12:51 AM |
Folha de São Paulo - Printed Edition - ERROR after site make over | William_M_S | Recipes | 4 | 12-02-2012 05:34 AM |
Folha de São Paulo - Printed Edition - ERROR | luis.nando | Recipes | 6 | 05-08-2012 03:22 PM |
Folha de São Paulo - Printed Edition | fluzao | Recipes | 6 | 11-13-2011 08:02 PM |