View Single Post
Old 03-08-2011, 11:05 PM   #12
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by bthoven View Post
I'm trying to extract articles from this page (sorry the content is in Thai language)

http://www.naewna.com/allnews.asp?ID=79

When viewing the source, I need to extract article content from the article links from line 418-717.

Each article link would be something like

http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers)

Could you guide me?

Thanks in advance.

I took a look at your Thai source, and modified my recipe to extract your links. I find a problem however - the Thai text is not correctly rendered, and while I can view the resulting e-book in MobiPocket Reader, and it looks like the desired e-book (the images in the articles appear correct), the text is not proper Unicode. The e-book crashes the Calibre EPUB reader, and causes errors on my Kindle.

It is possible that you may be able to use the recipe below on a computer running a Thai version of the operating system (I use English language Windows 7 Professional), but I suspect that you will have the same text problem, as I suspect that it is because of the encoding of the source web pages, content="text/html; charset=windows-874".

The source for http://www.naewna.com/allnews.asp?ID=79 starts with:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-874">
whereas http://www.thairath.co.th/rss/news.xml (for Thairath, a built-in Thai recipe which renders correctly for me) starts with:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
	<title>...
It seems likely to me that the problem is that UTF-8 Thai pages render correctly, but that windows-874 Thai pages do not render correctly, when processed by Calibre. The improper "Unicode" text then causes the Calibre EPUB reader crash (Calibre itself continues to run, only the separate reader process crashes). A test on your computer, which I assume has a Thai language operating system, should determine whether my suspicion is correct. I have added logging of the link extraction so that you can see this even if extraction fails. I have built the e-book a number of times, but had one failure which I suspect is caused by some combination of corrupt Unicode characters. I have also commented out the article editing to leave the full article. I do not read Thai so I did not spend time guessing what should be removed. When I looked at the article source however I did notice that there were not many id or class attributes on tags such as div or span, so I also suspect that removing unwanted parts of the article page may be more difficult.

Please post the result of your test. If the problem is the encoding of the source pages, it may be worth submitting this as a enhancement request/bug report. Similar problems would probably arise for other languages where multi-byte non-Unicode encoding is used.

The recipe (note warning above regarding text rendering problems and crashing of the Calibre EPUB reader):
Code:
class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = 'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    language = 'th'

    #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        #dict(name='div', attrs={'id':'utilidades'}),
        #dict(name='div', attrs={'class':'info_relacionada'}),
        #dict(name='div', attrs={'class':'mod_apoyo'}),
        #dict(name='div', attrs={'class':'contorno_f'}),
        #dict(name='div', attrs={'class':'pestanias'}),
        #dict(name='div', attrs={'class':'otros_webs'}),
        #dict(name='div', attrs={'id':'pie'})
        #]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('body'):
            section_title = self.tag_to_string(section.find('h1'))
            z = section.find('td', attrs={'background':'images/fa04.gif'})
            self.log('z', z)
            x = z.find('font')
            self.log('x', x)
            y = x.find('strong')
            self.log('y', y)
            section_title = self.tag_to_string(y)
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = u'Thai Feed'
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                #self.log('--> p: ', post)
                url = post['href']
                #self.log('--> u: ', url)
                if url.startswith('n'):
                  url = 'http://www.naewna.com/'+url
                  #self.log('--> u: ', url)
                  title = self.tag_to_string(post)
                  #self.log('--> t: ', title)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    #self.log('--> k: ', klass)
                    if klass == "style4 style15":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote