How to convert newspaper which do not have RSS feed? - Page 2

oneillpt · 03-09-2011, 10:55 AM

Quote:

Originally Posted by bthoven

Hi oneilpt,

I tried to fetch the news by using your script, here is the error on my side, not sure what to do next:

calibre, version 0.7.48
ERROR: Conversion Error: <b>Failed</b>: Fetch news from Jermsak_Naewna

Fetch news from Jermsak_Naewna
Resolved conversion options
calibre version: 0.7.48
...
--> class: style4 style15
Python function terminated unexpectedly
'class' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
File "site-packages\calibre\customize\conversion.py", line 204, in __call__
File "site-packages\calibre\web\feeds\input.py", line 105, in convert
File "site-packages\calibre\web\feeds\news.py", line 734, in download
File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index
klass = post['class']
File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__
KeyError: 'class'

Found a change to the source page caused a similar problem for me today. The revised recipe below fixed this. Looking at your log though I see the same "diamond" invalid characters which I get, whereas the log from the built-in Thai recipes shows proper Thai characters. Try this revised recipe anyway and see if the book looks right, other than the corrupted text. If it does, then the next step is to report the character encoding problem. It still crashes the Calibre reader, but can be viewed in MobiPocket Reader.

I also built the e-book under Ubuntu Linux to see if the problem was specific to Windows. The same "diamond" invalid characters appeared, but the e-book in this case did not crash the Calibre reader. The images however were not visible in the e-book in the Calibre reader, whereas they were visible in the MobiPocket Reader under Windows.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = u'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    language = 'th'

    #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        #dict(name='div', attrs={'id':'utilidades'}),
        #dict(name='div', attrs={'class':'info_relacionada'}),
        #dict(name='div', attrs={'class':'mod_apoyo'}),
        #dict(name='div', attrs={'class':'contorno_f'}),
        #dict(name='div', attrs={'class':'pestanias'}),
        #dict(name='div', attrs={'class':'otros_webs'}),
        #dict(name='div', attrs={'id':'pie'})
        #]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('body'):
            section_title = self.tag_to_string(section.find('h1'))
            z = section.find('td', attrs={'background':'images/fa04.gif'})
            self.log('z', z)
            x = z.find('font')
            self.log('x', x)
            y = x.find('strong')
            self.log('y', y)
            section_title = self.tag_to_string(y)
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = u'Thai Feed'
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                self.log('--> p: ', post)
                url = post['href']
                self.log('--> u: ', url)
                if url.startswith('n'):
                  url = 'http://www.naewna.com/'+url
                  self.log('--> u: ', url)
                  title = self.tag_to_string(post)
                  self.log('--> t: ', title)
                  if str(post).find('class="style4 style15"') > 0:
                    klass = post['class']
                    self.log('--> k: ', klass)
                    if klass == "style4 style15":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds

oneillpt · 03-09-2011, 08:58 PM

I'm now fairly sure that the source encoding is the problem. I saved the source of the index page and of the first two links followed, then converted these Windows-874 files to UTF-8, edited INDEX in the recipe to point to the local UTF-8 converted copy, and edited the links in the index page to point to the local UTF-8 converted copies for the two links copied.

I've attached the epub and mobi versions of the e-book built for you to check. The text now looks Thai to me!

bthoven · 03-09-2011, 11:46 PM

Hi oneillpt,

Just try your latest script, Calibre got the content without error. As you said, Calibre will crash when trying to open the content.

I don't know why they are still using 874 codepage, instead of others which are more popular.

The thai.mobi/epub display Thai correctly.

Thanks again for your kind help.

oneillpt · 03-24-2011, 07:24 PM

Quote:

Originally Posted by bthoven

Hi oneillpt,

Just try your latest script, Calibre got the content without error. As you said, Calibre will crash when trying to open the content.

I don't know why they are still using 874 codepage, instead of others which are more popular.

The thai.mobi/epub display Thai correctly.

Thanks again for your kind help.

And now the solution: it just needs one line added to the recipe, specifying the encoding. encoding = 'cp874'

So the recipe now starts with:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = u'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    #  (you may want to select a masthead image from your source here)
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    encoding              = 'cp874'
    language = 'th_TH'
    oldest_article = 7
    max_articles_per_feed = 2

and then continues as before. The MOBI output now shows proper Thai text. The formatting may need further work, but this is as far as I can go. Not being able to read Thai, all I can add is that the text looks centred, but probably should not be centred.

bthoven · 03-24-2011, 09:28 PM

Wow...thanks a lot. Let me try tonight.

technicaltitch · 07-29-2011, 10:58 AM

oneillpt thank you ENORMOUSLY for your tutorial! It was the perfect intro to recipe building - nice and simple code, instructions step by step. Your guide fills the gap for noobs' first steps trying to learn Python, Soup and recipe building at the same time. I was hopeless at getting anything to work until I found your tips- now got mine working!

fluzao · 07-29-2011, 01:14 PM

Thumbs up to oneillpt! You guys that are 100% programmers don't understand how useful the commented code is.

luis.nando · 08-21-2011, 12:15 PM

Hello community,

I need some help solving the problem on the thread: https://www.mobileread.com/forums/sho...d.php?t=146324

someone?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is there a good way to convert partial rss to full rss feeds.	Zorz	Other formats	5	05-29-2010 01:17 PM
RSS Feed	timezone	Feedback	8	01-02-2010 07:55 PM
RSS Feed Newspaper without Calibre	ggareau	Sony Reader	4	07-30-2009 02:06 AM
RSS Feed Updates	Alexander Turcic	Announcements	0	06-11-2004 05:11 PM

03-09-2011, 11:46 PM	#18
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Hi oneillpt, Just try your latest script, Calibre got the content without error. As you said, Calibre will crash when trying to open the content. I don't know why they are still using 874 codepage, instead of others which are more popular. The thai.mobi/epub display Thai correctly. Thanks again for your kind help.

03-24-2011, 09:28 PM	#20
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Wow...thanks a lot. Let me try tonight.

07-29-2011, 10:58 AM	#21
technicaltitch Member Posts: 20 Karma: 10 Join Date: Jul 2011 Device: Sony PRS 350	oneillpt thank you ENORMOUSLY for your tutorial! It was the perfect intro to recipe building - nice and simple code, instructions step by step. Your guide fills the gap for noobs' first steps trying to learn Python, Soup and recipe building at the same time. I was hopeless at getting anything to work until I found your tips- now got mine working!

07-29-2011, 01:14 PM	#22
fluzao Member Posts: 15 Karma: 10 Join Date: Apr 2011 Device: Kindle	Thumbs up to oneillpt! You guys that are 100% programmers don't understand how useful the commented code is.

08-21-2011, 12:15 PM	#23
luis.nando Member Posts: 22 Karma: 20 Join Date: Aug 2011 Device: Kindle 3	Hello community, I need some help solving the problem on the thread: https://www.mobileread.com/forums/sho...d.php?t=146324 someone?

Advert

Advert