|
|
#17 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Hi guys,
Why dont you use the receipt created by Euler Alves and Alex Mitrani, that it is already included on Calibri last version (no UOL login required)? Since it was not updated, I´ve easily improved it by including new columnists and sections. Works perfectly. |
|
|
|
| Advert | |
|
|
|
|
#18 |
|
Member
![]() Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Their recipe is great, you are right. However, there is a lot of bad journalistic material that goes into Folha website, that does not creeps in the printed version. Also, AFAIK there is no way of isolating what went into the printed version from the aggregated .rss articles.
Thus, I can see use for both recipes, they are complementary. |
|
|
|
|
|
#19 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Luis, you are right about the printed version issue, but the quality of the articles are great, no "bad materials" are coming.
Anyway, there is my improved version of the original receipt. Cover, sections and columnists updated. Code:
from calibre.web.feeds.news import BasicNewsRecipe
from datetime import datetime, timedelta
from calibre.ebooks.BeautifulSoup import Tag,BeautifulSoup
from calibre.utils.magick import Image, PixelWand
from urllib2 import Request, urlopen, URLError
class FolhaOnline(BasicNewsRecipe):
THUMBALIZR_API = '' # ---->Get your at http://www.thumbalizr.com/ and put here
LANGUAGE = 'pt_br'
language = 'pt_BR'
LANGHTM = 'pt-br'
ENCODING = 'cp1252'
ENCHTM = 'iso-8859-1'
directionhtm = 'ltr'
requires_version = (0,7,47)
news = True
title = u'Folha de S\xE3o Paulo improved'
__author__ = 'Euler Alves and Alex Mitrani, improved by Bola de Fogo'
description = u'Brazilian news from Folha de S\xE3o Paulo'
publisher = u'Folha de S\xE3o Paulo'
category = 'news, rss'
oldest_article = 4
max_articles_per_feed = 200
summary_length = 1000
remove_javascript = True
no_stylesheets = True
use_embedded_content = False
remove_empty_feeds = True
timefmt = ' [%d %b %Y (%a)]'
html2lrf_options = [
'--comment', description
,'--category', category
,'--publisher', publisher
]
html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"'
hoje = datetime.now()
pubdate = hoje.strftime('%a, %d %b')
if hoje.hour<6:
hoje = hoje-timedelta(days=1)
CAPA = 'http://img.kiosko.net/'+hoje.strftime('%Y')+'/'+hoje.strftime('%m')+'/'+hoje.strftime('%d')+'/br/br_folha_spaulo.200.jpg'
SCREENSHOT = 'http://www1.folha.uol.com.br/'
cover_margins = (0,0,'white')
masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'
keep_only_tags = [
dict(name='div', attrs={'id':'articleNew'}),
dict(name='article', id='news'),
]
feeds = [
(u'Em cima da hora', u'http://feeds.folha.uol.com.br/emcimadahora/rss091.xml')
,(u'Poder', u'http://feeds.folha.uol.com.br/poder/rss091.xml')
,(u'Cotidiano', u'http://feeds.folha.uol.com.br/cotidiano/rss091.xml')
,(u'Mercado', u'http://feeds.folha.uol.com.br/mercado/rss091.xml')
,(u'Mundo', u'http://feeds.folha.uol.com.br/mundo/rss091.xml')
,(u'Esporte', u'http://feeds.folha.uol.com.br/esporte/rss091.xml')
,(u'Comida', u'http://feeds.folha.uol.com.br/comida/rss091.xml')
,(u'Tec', u'http://feeds.folha.uol.com.br/tec/rss091.xml')
,(u'Ilustrada', u'http://feeds.folha.uol.com.br/ilustrada/rss091.xml')
,(u'Ambiente', u'http://feeds.folha.uol.com.br/ambiente/rss091.xml')
,(u'Opiniao', u'http://feeds.folha.uol.com.br/opiniao/rss091.xml')
,(u'Ci\xEAncia', u'http://feeds.folha.uol.com.br/ciencia/rss091.xml')
,(u'Equil\xEDbrio e Sa\xFAde', u'http://feeds.folha.uol.com.br/equilibrioesaude/rss091.xml')
,(u'Elio Gaspari', u'http://feeds.folha.uol.com.br/colunas/eliogaspari/rss091.xml')
,(u'Tati Bernardi', u'http://feeds.folha.uol.com.br/colunas/tatibernardi/rss091.xml')
,(u'PVC', u'http://feeds.folha.uol.com.br/colunas/pvc/rss091.xml')
,(u'Clóvis Rossi', u'http://feeds.folha.uol.com.br/colunas/clovisrossi/rss091.xml')
,(u'Hélio Schwartsman', u'http://feeds.folha.uol.com.br/colunas/helioschwartsman/rss091.xml')
,(u'Humberto Luiz Peron', u'http://feeds.folha.uol.com.br/colunas/futebolnarede/rss091.xml')
,(u'João Pereira Coutinho', u'http://feeds.folha.uol.com.br/colunas/joaopereiracoutinho/rss091.xml')
,(u'Cony', u'http://feeds.folha.uol.com.br/colunas/carlosheitorcony/rss091.xml')
,(u'Juca', u'http://feeds.folha.uol.com.br/colunas/jucakfouri/rss091.xml')
,(u'Viniciu Torres Freitas', u'http://feeds.folha.uol.com.br/colunas/viniciustorres/rss091.xml')
,(u'Monica Bergamo', u'http://feeds.folha.uol.com.br/colunas/monicabergamo/rss091.xml')
,(u'Vinicius Mota', u'http://feeds.folha.uol.com.br/colunas/viniciusmota/rss091.xml')
,(u'Bernardo Guimaraes', u'http://aeconomianoseculo21.blogfolha.uol.com.br/feed/')
,(u'Tostao', u'http://feeds.folha.uol.com.br/colunas/tostao/rss091.xml')
,(u'Valdo Cruz', u'http://feeds.folha.uol.com.br/colunas/valdocruz/rss091.xml')
]
conversion_options = {
'title' : title
,'comments' : description
,'publisher' : publisher
,'tags' : category
,'language' : LANGUAGE
,'linearize_tables': True
}
def preprocess_html(self, soup):
for item in soup.findAll(style=True):
del item['style']
if not soup.find(attrs={'http-equiv':'Content-Language'}):
meta0 = Tag(soup,'meta',[("http-equiv","Content-Language"),("content",self.LANGHTM)])
soup.head.insert(0,meta0)
if not soup.find(attrs={'http-equiv':'Content-Type'}):
meta1 = Tag(soup,'meta',[("http-equiv","Content-Type"),("content","text/html; charset="+self.ENCHTM)])
soup.head.insert(0,meta1)
return soup
def postprocess_html(self, soup, first):
# process all the images. assumes that the new html has the correct path
for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and 'src' in tag):
iurl = tag['src']
img = Image()
img.open(iurl)
width, height = img.size
print 'img is: ', iurl, 'width is: ', width, 'height is: ', height
if img < 0:
raise RuntimeError('Out of memory')
pw = PixelWand()
if(width > height and width > 590) :
print 'Rotate image'
img.rotate(pw, -90)
img.save(iurl)
return soup
def get_cover_url(self):
cover_url = self.CAPA
pedido = Request(self.CAPA)
pedido.add_header('User-agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; '+self.LANGHTM+'; userid='+self.THUMBALIZR_API+') Calibre/0.8.47 (like Gecko)')
pedido.add_header('Accept-Charset',self.ENCHTM)
pedido.add_header('Referer',self.SCREENSHOT)
try:
resposta = urlopen(pedido)
soup = BeautifulSoup(resposta)
cover_item = soup.find('body')
if cover_item:
cover_url='http://api.thumbalizr.com/?api_key='+self.THUMBALIZR_API+'&url='+self.SCREENSHOT+'&width=600&quality=90'
return cover_url
except URLError:
cover_url='http://api.thumbalizr.com/?api_key='+self.THUMBALIZR_API+'&url='+self.SCREENSHOT+'&width=600&quality=90'
return cover_url
|
|
|
|
|
|
#20 |
|
Member
![]() Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Dear Bola de Fogo,
Thanks for sharing. Note that the calibre devs keep an eye at these posts as they updated to the latest version as soon as I linked to it. So, perhaps, it is a good idea to post the recipe at Euler's original post as well, which I couldn't find. Here is one thread about that version, though: https://www.mobileread.com/forums/sho...d.php?t=146959 Up to you if you want to leave a heads up there. Best, |
|
|
|
| Advert | |
|
|
|
|
#21 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: May 2016
Device: Kindle Paper White
|
Luis,
Thanks for the help, my FSP improved version was included at the new calibri version released today. (thanks for Mr Goyal as well) |
|
|
|
|
|
#22 |
|
Member
![]() Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
We are again with a broken recipe for the Printed Folha Version. Some API changes in get_browser() may be the culprit. It would be nice if @kovidgoyal would help out on this one as I can't figure out how to correct it.
Below is the current broken code: Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re
import datetime
class FSP(BasicNewsRecipe):
title = u'Folha de S\xE3o Paulo'
__author__ = 'Joao Eduardo Bertacchi'
description = u'Printed edition contents. UOL subscription required.' + \
u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]'
today=datetime.date.today()
masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'
language = 'pt_BR'
no_stylesheets = True
max_articles_per_feed = 100
remove_javascript = True
needs_subscription = True
keep_only_tags = [
dict(name='div', id='articleNew'), dict(name='table', attrs={'class':'articleGraphic'}),
dict(name='article', id='news'),
]
publication_type = 'newspaper'
simultaneous_downloads = 5
remove_attributes = ['height','width']
# The following is an attempt to fix the problem with the section names, but whenever new sections are added it can generate accentuation problems still
section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada',
'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o',
'ciencia' : u'cincia' , 'saude' : u'sa\xfade',
'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio',
'imoveis' : u'im\xf3veis', 'negocios' : u'neg\xf3cios',
'veiculos' : u've\xedculos', 'corrida' : 'folha corrida',
'turismo':'turismo'}
# this solves the problem with truncated content in Kindle
conversion_options = {'linearize_tables' : True}
extra_css = """
#articleNew { font: 18px Times New Roman,verdana,arial; }
img { background: none !important; float: none; margin: 0px; }
.newstexts { list-style-type: none; height: 20px; margin: 15px 0 10px 0; }
.newstexts.last { border-top: 1px solid #ccc; margin: 5px 0 15px 0; padding-top: 15px; }
.newstexts li { display: inline; padding: 0 5px; }
.newstexts li.prev { float: left; }
.newstexts li.next { float: right; }
.newstexts li span { width: 12px; height: 15px; display: inline-block; }
.newstexts li.prev span { background-position: -818px -46px; }
.newstexts li.next span { background-position: -832px -46px; }
.newstexts li a { font: bold 12px arial, verdana, sans-serif; text-transform: uppercase; color: #999; text-decoration: none !important; }
.newstexts li a:hover { text-decoration: underline !important }
.headerart { font-weight: bold; }
.title { font: bold 39px Times New Roman,verdana,arial; margin-bottom: 15px; margin-top: 10px; }
.creditart, .origin { font: bold 12px arial, verdana, sans-serif; color: #999; margin: 0px; display: block; }
.headerart p, .fine_line p { margin: 0 !important; }
.fine_line { font: bold 18px Times New Roman,verdana,arial; }
.fine_line p { margin-bottom: 18px !important; }
.fine_line p:first-child { font-weight: normal; font-style: italic; font-size: 20px !important; }
.eye { display: block; width: 317px; border-top: 2px solid #666; padding: 7px 0 7px; border-bottom: 2px solid #666; font-style: italic; font-weight: bold; }
.kicker { font-weight: bold; text-transform: uppercase; font-size: 18px; font-family: Times New Roman,verdana,arial !important; }
.blue { color: #000080; }
.red { color: #F00; }
.blue { color: #000080; }
.green { color: #006400; }
.orange { color: #FFA042; }
.violet { color: #8A2BE2; }
.text_footer { font-size: 15px; }
.title_end { font-size: 23px; font-weight: bold; }
.divisor { text-indent: -9999px; border-bottom: 1px solid #ccc; height: 1px; margin: 0; }
.star { background: none !important; height: 15px; }
.articleGraphic { margin-bottom: 20px; }
"""
# This is the code for login, here a mini browser is called and id entered
def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
br.open('https://acesso.uol.com.br/login.html')
br.form = br.forms()
br['user'] = self.username
br['pass'] = self.password
br.submit().read()
return br
# Parsing the index webpage
def parse_index(self):
# In the last version, the index page became simpler:
INDEX = 'http://www1.folha.uol.com.br/fsp/'
self.log('--> INDEX set ', INDEX)
soup = self.index_to_soup(INDEX)
feeds = []
articles = []
section_title = u'Primeira p\xe1gina'
for post in soup.findAll('a'):
strpost = str(post)
if re.match('<a href="http://www1.folha.uol.com.br/.*/"><span.class="', strpost):
if articles:
feeds.append((section_title, articles))
self.log()
self.log('--> new section found, creating old section feed: ', section_title)
# section_title = post['name']
section_title = self.tag_to_string(post)
if section_title in self.section_dict:
section_title = self.section_dict[section_title]
articles = []
self.log('--> new section title: ', section_title)
elif strpost.startswith('<a href="/fsp/cp'):
break
elif strpost.startswith('<a href'):
url = post['href']
if url.startswith('http://www1.folha.uol.com.br/'):
title = self.tag_to_string(post)
self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
articles.append({'title':title, 'url':url})
if articles:
feeds.append((section_title, articles))
del feeds[0]
return feeds
Code:
calibre, version 3.4.0 (linux2, embedded-python: False)
Conversion error: Falha: Fetch news from Folha de São Paulo
Fetch news from Folha de São Paulo
Conversion options changed from defaults:
output_profile: 'kindle'
verbose: 2
Resolved conversion options
calibre version: 3.4.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'expand_css': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x7fa8a5e11850>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_file_type': 'old',
'mobi_ignore_margins': False,
'mobi_keep_original_images': False,
'mobi_toc_at_start': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindleOutput object at 0x7fa8a5e11f10>,
'page_breaks_before': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'transform_css_rules': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Using custom recipe
Traceback (most recent call last):
File "/usr/bin/calibre-parallel", line 20, in <module>
sys.exit(main())
File "/usr/lib/calibre/calibre/utils/ipc/worker.py", line 195, in main
result = func(*args, **kwargs)
File "/usr/lib/calibre/calibre/gui2/convert/gui_conversion.py", line 26, in gui_convert
plumber.run()
File "/usr/lib/calibre/calibre/ebooks/conversion/plumber.py", line 1088, in run
accelerators, tdir)
File "/usr/lib/calibre/calibre/customize/conversion.py", line 245, in __call__
log, accelerators)
File "/usr/lib/calibre/calibre/ebooks/conversion/plugins/recipe_input.py", line 118, in convert
ro = recipe(opts, log, self.report_progress)
File "/usr/lib/calibre/calibre/web/feeds/news.py", line 904, in __init__
self.browser = self.get_browser()
File "<string>", line 86, in get_browser
AttributeError: 'list' object has no attribute 'next'
|
|
|
|
|
|
#23 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,690
Karma: 28549304
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That is already fixed in the builtin recipe.
|
|
|
|
|
|
#24 |
|
Member
![]() Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Thank you! It is working perfectly...
|
|
|
|
|
|
#25 | |
|
Member
![]() Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
|
Hi everyone,
We are back with a broken recipe. The error is: Quote:
|
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Updated recipe for Folha de Sao Paulo (Brazil) | XaleM | Recipes | 2 | 02-04-2018 06:44 PM |
| New recipe for Folha de Sao Paulo - printed version | joaobertacchi | Recipes | 3 | 06-16-2013 01:51 AM |
| Folha de São Paulo - Printed Edition - ERROR after site make over | William_M_S | Recipes | 4 | 12-02-2012 06:34 AM |
| Folha de São Paulo - Printed Edition - ERROR | luis.nando | Recipes | 6 | 05-08-2012 04:22 PM |
| Folha de São Paulo - Printed Edition | fluzao | Recipes | 6 | 11-13-2011 09:02 PM |