Hi
I am trying to make a recipe for downloading news from govt site i.e. pib.nic.in
so was trying to do some tweaking aroung and bumped into this problem..
When ebook-convert downloads the news, it turns some
punctuation into junk characters..
Here is the code
Spoiler:
from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2014, Amit <amitkp.ias@gmail.com>'
from calibre.web.feeds.news import BasicNewsRecipe
class My_Feeds(BasicNewsRecipe):
title = 'PIB Daily'
language = 'en_IN'
oldest_article = 1.2
__author__ = 'Amit'
max_articles_per_feed = 100
no_stylesheets = True
remove_javascript = True
center_navbar = True
use_embedded_content = False
remove_empty_feeds = True
keep_only_tags = [ dict(id=['ministry']),
dict(attrs={'class':['contentdiv']})
]
def preprocess_raw_html(self, raw, url):
return raw.replace('lang=EN-US', 'lang="en_US"').replace('lang=EN-IN', 'lang="en_IN"')
def parse_index(self):
feeds = []
current_section = 'Section'
current_articles = []
current_articles.append({'url':'http://pib.nic.in/newsite/efeatures.aspx?relid=108697',
'title':'Climate Change Issues Need Better Attention',
'date': '',
'description':''})
feeds.append((current_section, current_articles))
return feeds
Original HTML had
after running the recipe
How can i fix it?