Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-20-2014, 12:30 PM   #1
knowledgecrawler
Member
knowledgecrawler began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Aug 2014
Device: kindle
Recipe turning some punctuation marks as non-printable characters

Hi

I am trying to make a recipe for downloading news from govt site i.e. pib.nic.in
so was trying to do some tweaking aroung and bumped into this problem..

When ebook-convert downloads the news, it turns some punctuation into junk characters..

Here is the code
Spoiler:

from __future__ import with_statement
__license__ = 'GPL 3'
__copyright__ = '2014, Amit <amitkp.ias@gmail.com>'

from calibre.web.feeds.news import BasicNewsRecipe

class My_Feeds(BasicNewsRecipe):
title = 'PIB Daily'
language = 'en_IN'
oldest_article = 1.2
__author__ = 'Amit'
max_articles_per_feed = 100
no_stylesheets = True
remove_javascript = True
center_navbar = True
use_embedded_content = False

remove_empty_feeds = True
keep_only_tags = [ dict(id=['ministry']),
dict(attrs={'class':['contentdiv']})
]

def preprocess_raw_html(self, raw, url):
return raw.replace('lang=EN-US', 'lang="en_US"').replace('lang=EN-IN', 'lang="en_IN"')

def parse_index(self):
feeds = []
current_section = 'Section'
current_articles = []
current_articles.append({'url':'http://pib.nic.in/newsite/efeatures.aspx?relid=108697',
'title':'Climate Change Issues Need Better Attention',
'date': '',
'description':''})
feeds.append((current_section, current_articles))
return feeds


Original HTML had
Spoiler:

countries to “walk the Talk” in this regard


after running the recipe
Spoiler:

countries to “walk the Talk� in this regard



How can i fix it?
knowledgecrawler is offline   Reply With Quote
Old 08-20-2014, 12:58 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use the correct value of encoding in your recipe.
kovidgoyal is offline   Reply With Quote
Advert
Old 08-20-2014, 11:09 PM   #3
knowledgecrawler
Member
knowledgecrawler began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Aug 2014
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
Use the correct value of encoding in your recipe.
what value should i use here?
HTML source has EN-IN, should i leave it as such?
Even if don't use preprocessraw, it doesn't fix it...
knowledgecrawler is offline   Reply With Quote
Old 08-20-2014, 11:13 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to fiure out what the encoding for the html pages you are scraping is. Common choices, latin1, cp1252, utf-8
kovidgoyal is offline   Reply With Quote
Old 08-20-2014, 11:47 PM   #5
knowledgecrawler
Member
knowledgecrawler began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Aug 2014
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
You need to fiure out what the encoding for the html pages you are scraping is. Common choices, latin1, cp1252, utf-8
Found the charset
PHP Code:
<meta http-equiv=Content-Type content="text/html; charset=windows-1252"
Tried with
PHP Code:
 encoding 'cp1252' 
This fixed the issue
PHP Code:
encoding 'utf-8' 
Kudos!..

Last edited by knowledgecrawler; 08-20-2014 at 11:59 PM.
knowledgecrawler is offline   Reply With Quote
Advert
Reply

Tags
lang, punctuation


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
13 Little-Known Punctuation Marks We Should Be Using VydorScope Writers' Corner 16 11-16-2012 02:40 PM
Question marks instead of most special characters - HTML->mobi vermontcathy Conversion 3 09-29-2012 11:42 AM
Strange text characters and missing chapter marks on Kindle 3 Grahamk Conversion 7 02-28-2011 02:14 AM
Loss of Punctuation Marks AllyBally Calibre 2 12-30-2010 03:03 PM
Extra punctuation marks in epub after loading from SRL to PRS-600 planters Sony Reader 8 03-12-2010 11:38 AM


All times are GMT -4. The time now is 08:52 AM.


MobileRead.com is a privately owned, operated and funded community.