Quote:
Originally Posted by bthoven
|
I took a look at your Thai source, and modified my recipe to extract your links. I find a problem however - the Thai text is not correctly rendered, and while I can view the resulting e-book in MobiPocket Reader, and it looks like the desired e-book (the images in the articles appear correct), the text is not proper Unicode. The e-book crashes the Calibre EPUB reader, and causes errors on my Kindle.
It is possible that you may be able to use the recipe below on a computer running a Thai version of the operating system (I use English language Windows 7 Professional), but I suspect that you will have the same text problem, as I suspect that it is because of the encoding of the source web pages,
content="text/html; charset=windows-874".
The source for
http://www.naewna.com/allnews.asp?ID=79 starts with:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-874">
whereas
http://www.thairath.co.th/rss/news.xml (for Thairath, a built-in Thai recipe which renders correctly for me) starts with:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>...
It seems likely to me that the problem is that UTF-8 Thai pages render correctly, but that windows-874 Thai pages do not render correctly, when processed by Calibre. The improper "Unicode" text then causes the Calibre EPUB reader crash (Calibre itself continues to run, only the separate reader process crashes). A test on your computer, which I assume has a Thai language operating system, should determine whether my suspicion is correct. I have added logging of the link extraction so that you can see this even if extraction fails. I have built the e-book a number of times, but had one failure which I suspect is caused by some combination of corrupt Unicode characters. I have also commented out the article editing to leave the full article. I do not read Thai so I did not spend time guessing what should be removed. When I looked at the article source however I did notice that there were not many id or class attributes on tags such as div or span, so I also suspect that removing unwanted parts of the article page may be more difficult.
Please post the result of your test. If the problem is the encoding of the source pages, it may be worth submitting this as a enhancement request/bug report. Similar problems would probably arise for other languages where multi-byte non-Unicode encoding is used.
The recipe (note warning above regarding text rendering problems and crashing of the Calibre EPUB reader):
Code:
class thai(BasicNewsRecipe):
title = u'thai'
__author__ = 'oneillpt'
#masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
language = 'th'
#remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
#keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
#remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
#dict(name='div', attrs={'id':'utilidades'}),
#dict(name='div', attrs={'class':'info_relacionada'}),
#dict(name='div', attrs={'class':'mod_apoyo'}),
#dict(name='div', attrs={'class':'contorno_f'}),
#dict(name='div', attrs={'class':'pestanias'}),
#dict(name='div', attrs={'class':'otros_webs'}),
#dict(name='div', attrs={'id':'pie'})
#]
no_stylesheets = True
remove_javascript = True
def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
for section in soup.findAll('body'):
section_title = self.tag_to_string(section.find('h1'))
z = section.find('td', attrs={'background':'images/fa04.gif'})
self.log('z', z)
x = z.find('font')
self.log('x', x)
y = x.find('strong')
self.log('y', y)
section_title = self.tag_to_string(y)
self.log('section_title(1): ', section_title)
if section_title == "":
section_title = u'Thai Feed'
self.log('section_title(2): ', section_title)
articles = []
for post in section.findAll('a', href=True):
#self.log('--> p: ', post)
url = post['href']
#self.log('--> u: ', url)
if url.startswith('n'):
url = 'http://www.naewna.com/'+url
#self.log('--> u: ', url)
title = self.tag_to_string(post)
#self.log('--> t: ', title)
if str(post).find('class=') > 0:
klass = post['class']
#self.log('--> k: ', klass)
if klass == "style4 style15":
self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
self.log('--> class: ', klass)
articles.append({'title':title, 'url':url})
if articles:
feeds.append((section_title, articles))
return feeds