View Single Post
Old 08-20-2014, 12:30 PM   #1
knowledgecrawler
Member
knowledgecrawler began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Aug 2014
Device: kindle
Recipe turning some punctuation marks as non-printable characters

Hi

I am trying to make a recipe for downloading news from govt site i.e. pib.nic.in
so was trying to do some tweaking aroung and bumped into this problem..

When ebook-convert downloads the news, it turns some punctuation into junk characters..

Here is the code
Spoiler:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2014, Amit <amitkp.ias@gmail.com>'

from calibre.web.feeds.news import BasicNewsRecipe

class My_Feeds(BasicNewsRecipe):
title = 'PIB Daily'
language = 'en_IN'
oldest_article = 1.2
__author__ = 'Amit'
max_articles_per_feed = 100
no_stylesheets = True
remove_javascript = True
center_navbar = True
use_embedded_content = False

remove_empty_feeds = True
keep_only_tags = [ dict(id=['ministry']),
dict(attrs={'class':['contentdiv']})
]

def preprocess_raw_html(self, raw, url):
return raw.replace('lang=EN-US', 'lang="en_US"').replace('lang=EN-IN', 'lang="en_IN"')

def parse_index(self):
feeds = []
current_section = 'Section'
current_articles = []
current_articles.append({'url':'http://pib.nic.in/newsite/efeatures.aspx?relid=108697',
'title':'Climate Change Issues Need Better Attention',
'date': '',
'description':''})
feeds.append((current_section, current_articles))
return feeds


Original HTML had
Spoiler:

countries to “walk the Talk” in this regard


after running the recipe
Spoiler:

countries to “walk the Talk� in this regard



How can i fix it?
knowledgecrawler is offline   Reply With Quote