Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-09-2011, 09:55 AM   #16
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by bthoven View Post
Hi oneilpt,

I tried to fetch the news by using your script, here is the error on my side, not sure what to do next:


calibre, version 0.7.48
ERROR: Conversion Error: <b>Failed</b>: Fetch news from Jermsak_Naewna

Fetch news from Jermsak_Naewna
Resolved conversion options
calibre version: 0.7.48
...
--> class: style4 style15
Python function terminated unexpectedly
'class' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
File "site-packages\calibre\customize\conversion.py", line 204, in __call__
File "site-packages\calibre\web\feeds\input.py", line 105, in convert
File "site-packages\calibre\web\feeds\news.py", line 734, in download
File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index
klass = post['class']
File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__
KeyError: 'class'
Found a change to the source page caused a similar problem for me today. The revised recipe below fixed this. Looking at your log though I see the same "diamond" invalid characters which I get, whereas the log from the built-in Thai recipes shows proper Thai characters. Try this revised recipe anyway and see if the book looks right, other than the corrupted text. If it does, then the next step is to report the character encoding problem. It still crashes the Calibre reader, but can be viewed in MobiPocket Reader.

I also built the e-book under Ubuntu Linux to see if the problem was specific to Windows. The same "diamond" invalid characters appeared, but the e-book in this case did not crash the Calibre reader. The images however were not visible in the e-book in the Calibre reader, whereas they were visible in the MobiPocket Reader under Windows.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = u'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    language = 'th'

    #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        #dict(name='div', attrs={'id':'utilidades'}),
        #dict(name='div', attrs={'class':'info_relacionada'}),
        #dict(name='div', attrs={'class':'mod_apoyo'}),
        #dict(name='div', attrs={'class':'contorno_f'}),
        #dict(name='div', attrs={'class':'pestanias'}),
        #dict(name='div', attrs={'class':'otros_webs'}),
        #dict(name='div', attrs={'id':'pie'})
        #]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('body'):
            section_title = self.tag_to_string(section.find('h1'))
            z = section.find('td', attrs={'background':'images/fa04.gif'})
            self.log('z', z)
            x = z.find('font')
            self.log('x', x)
            y = x.find('strong')
            self.log('y', y)
            section_title = self.tag_to_string(y)
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = u'Thai Feed'
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                self.log('--> p: ', post)
                url = post['href']
                self.log('--> u: ', url)
                if url.startswith('n'):
                  url = 'http://www.naewna.com/'+url
                  self.log('--> u: ', url)
                  title = self.tag_to_string(post)
                  self.log('--> t: ', title)
                  if str(post).find('class="style4 style15"') > 0:
                    klass = post['class']
                    self.log('--> k: ', klass)
                    if klass == "style4 style15":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote
Old 03-09-2011, 07:58 PM   #17
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
I'm now fairly sure that the source encoding is the problem. I saved the source of the index page and of the first two links followed, then converted these Windows-874 files to UTF-8, edited INDEX in the recipe to point to the local UTF-8 converted copy, and edited the links in the index page to point to the local UTF-8 converted copies for the two links copied.

I've attached the epub and mobi versions of the e-book built for you to check. The text now looks Thai to me!
Attached Files
File Type: mobi thai.mobi (103.5 KB, 191 views)
File Type: epub thai.epub (233.1 KB, 213 views)
oneillpt is offline   Reply With Quote
Advert
Old 03-09-2011, 10:46 PM   #18
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
Hi oneillpt,

Just try your latest script, Calibre got the content without error. As you said, Calibre will crash when trying to open the content.

I don't know why they are still using 874 codepage, instead of others which are more popular.

The thai.mobi/epub display Thai correctly.

Thanks again for your kind help.
bthoven is offline   Reply With Quote
Old 03-24-2011, 06:24 PM   #19
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by bthoven View Post
Hi oneillpt,

Just try your latest script, Calibre got the content without error. As you said, Calibre will crash when trying to open the content.

I don't know why they are still using 874 codepage, instead of others which are more popular.

The thai.mobi/epub display Thai correctly.

Thanks again for your kind help.
And now the solution: it just needs one line added to the recipe, specifying the encoding. encoding = 'cp874'

So the recipe now starts with:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = u'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    #  (you may want to select a masthead image from your source here)
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    encoding              = 'cp874'
    language = 'th_TH'
    oldest_article = 7
    max_articles_per_feed = 2
and then continues as before. The MOBI output now shows proper Thai text. The formatting may need further work, but this is as far as I can go. Not being able to read Thai, all I can add is that the text looks centred, but probably should not be centred.
oneillpt is offline   Reply With Quote
Old 03-24-2011, 08:28 PM   #20
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
Wow...thanks a lot. Let me try tonight.
bthoven is offline   Reply With Quote
Advert
Old 07-29-2011, 09:58 AM   #21
technicaltitch
Member
technicaltitch began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jul 2011
Device: Sony PRS 350
oneillpt thank you ENORMOUSLY for your tutorial! It was the perfect intro to recipe building - nice and simple code, instructions step by step. Your guide fills the gap for noobs' first steps trying to learn Python, Soup and recipe building at the same time. I was hopeless at getting anything to work until I found your tips- now got mine working!
technicaltitch is offline   Reply With Quote
Old 07-29-2011, 12:14 PM   #22
fluzao
Member
fluzao began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Device: Kindle
Thumbs up to oneillpt! You guys that are 100% programmers don't understand how useful the commented code is.
fluzao is offline   Reply With Quote
Old 08-21-2011, 11:15 AM   #23
luis.nando
Member
luis.nando began at the beginning.
 
Posts: 22
Karma: 20
Join Date: Aug 2011
Device: Kindle 3
Hello community,

I need some help solving the problem on the thread: https://www.mobileread.com/forums/sho...d.php?t=146324

someone?
luis.nando is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a good way to convert partial rss to full rss feeds. Zorz Other formats 5 05-29-2010 12:17 PM
RSS Feed timezone Feedback 8 01-02-2010 06:55 PM
RSS Feed Newspaper without Calibre ggareau Sony Reader 4 07-30-2009 01:06 AM
RSS Feed Updates Alexander Turcic Announcements 0 06-11-2004 04:11 PM


All times are GMT -4. The time now is 09:06 PM.


MobileRead.com is a privately owned, operated and funded community.