Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-07-2011, 05:47 AM   #1
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
How to convert newspaper which do not have RSS feed?

With Calibre, we can easily convert newspapers, with RSS feeds, to enews.

As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?
bthoven is offline   Reply With Quote
Old 03-07-2011, 10:14 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by bthoven View Post
there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?
Yes. See parse_index: http://calibre-ebook.com/user_manual...pe.parse_index
Starson17 is offline   Reply With Quote
Old 03-07-2011, 10:41 AM   #3
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 51
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1
Quote:
Originally Posted by bthoven View Post
With Calibre, we can easily convert newspapers, with RSS feeds, to enews.

As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?
You need to override the parse_index procedure. The NYTimes example in the Calibre User Manual, http://calibre-ebook.com/user_manual/news.html, shows how this can be done. Grep parse_index in the built-in recipes to find more examples.

As a simpler example may be helpful, I have added a recipe for Babelia en El Pais, recently requested in this forum, at the end of this reply, and I have also added comments on this recipe immediately below to help you understand the process (note that indentation is important in Python, but lost in these comments. See the code for the correct indentation). As the site does not return any duplicate links, I have kept the recipe simple by not checking for duplicate links. See some of the built-in recipes to see how duplicate checking can be carried out.

I hope this helps:

(1) import the basic recipe and needed parts from BeautifulSoup

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

(2) declare your class, derived from BasaicNewsRecipe, and set the variable INDEX to the url for the site page with links

class ElPaisBabelia(BasicNewsRecipe):

title = 'El Pais Babelia'
__author__ = 'oneillpt'
description = 'El Pais Babelia'
INDEX = 'http://www.elpais.com/suple/babelia/'
language = 'es'

(3) examining the page source for the individual article pages we find that the text, with some additional matter not required, is contained in a DIV section with class="estructura_2col". keep_tags specifies that we work with this section, remove_tags_before removes some links which would otherwise appear before the article. Note that we deal with article extraction here, before we deal with link extraction later by overriding parse_index

remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]

(4) remove_tags removes the additional matter not required for the article. Add this after examining the generated article output, identifying the unwanted matter in the original page source

remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
dict(name='div', attrs={'id':'utilidades'}),
dict(name='div', attrs={'class':'info_relacionada'}),
dict(name='div', attrs={'class':'mod_apoyo'}),
dict(name='div', attrs={'class':'contorno_f'}),
dict(name='div', attrs={'class':'pestanias'}),
dict(name='div', attrs={'class':'otros_webs'}),
dict(name='div', attrs={'id':'pie'})
]

(5) you will probably want to remove javascript, and may want to disable loading of stylesheets. Here, this does not make much difference, so I have retained the line for future use if desired, but made it a comment using "#"

#no_stylesheets = True
remove_javascript = True

(6) parse_index finds the article links, using the INDEX variable, and looking for links in a DIV with class="contenedor_nuevo". No cover image is specified. All subsequent lines here are part of parse_index. See the code for the correct indentation structure

def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
section_title = self.tag_to_string(section.find('h1'))
articles = []

(7) all article links have a "href" attribute

for post in section.findAll('a', href=True):
url = post['href']

(8) other links may also have a "href" attribute, but article links will start with "/", and need the base url appended

if url.startswith('/'):
url = 'http://www.elpais.es'+url
title = self.tag_to_string(post)

(9) we may still have too many links, but all article links will have a class attribute. This class attribute changes, so we just check for existence, not value. Two points to note are that the class variable has been named klass as class appears to be a reserved keyword in this context, and that post['class'] will cause an error if there is no class attribute. So we first convert the post soup to a string, and check whether it contains "class="

if str(post).find('class=') > 0:
klass = post['class']
if klass != "":

(10) you may find it useful to log output to see what is happening. This output will appear in the job details when built with Calibre. Remember that you can also perform manual extraction from a command prompt:

ebook-extract ElPaisBabelia.recipe ELPB --test -vv

and in this case you can examine the html source for the two articles which will be extracted in the ELPB folder structure


self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
self.log('--> class: ', klass)

(11) build the list of article links

articles.append({'title':title, 'url':url})

(12) and if any article links have been found, append the article list to the feed list, which is finally returned

if articles:
feeds.append((section_title, articles))
return feeds

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    #no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote
Old 03-07-2011, 07:53 PM   #4
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
Wow. Thanks a lot. I'll try and let you know.
bthoven is offline   Reply With Quote
Old 03-07-2011, 08:20 PM   #5
luiscc
Junior Member
luiscc began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: kindle

wonderful example.
Thank You!
luiscc is offline   Reply With Quote
Old 03-07-2011, 10:27 PM   #6
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
Just wonder both The New York Times and El Pais Babelia have their RSS pages,

http://www.nytimes.com/services/xml/rss/index.html
http://www.elpais.com/rss/index.html

why don't we start from there?

The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above?
bthoven is offline   Reply With Quote
Old 03-07-2011, 11:25 PM   #7
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 51
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1
Quote:
Originally Posted by bthoven View Post
Just wonder both The New York Times and El Pais Babelia have their RSS pages,

http://www.nytimes.com/services/xml/rss/index.html
http://www.elpais.com/rss/index.html

why don't we start from there?

The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above?
El Pais does have RSS feeds, but not for Babelia en El Pais, so you cannot start from an RSS feed in that case. The New York Times recipe is shown as an example in the Calibre User Manual, but I do not know why the RSS feed is not used in that case. I suspect some RSS feeds may be less usable than others.

If your newspaper does not have RSS you need a recipe similar to mine (or the more complex examples in the built-in recipes if the html structure is more complex), and you can modify my example to help you get started.
oneillpt is offline   Reply With Quote
Old 03-07-2011, 11:33 PM   #8
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
oneillpt...thanks a lot. That's clear to me now.

So far I just use some simple tag expression, this is quite a jumping step for me. However, it's worth trying.
bthoven is offline   Reply With Quote
Old 03-08-2011, 03:00 AM   #9
miwie
Connoisseur
miwie began at the beginning.
 
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
Two suggestions for improvement:
  1. Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    (e.g. before the INDEX line). This adds the logo of El Pais at the top
    of the feed overview (which contains only one feed in this case)
  2. Activate 'no_stylesheets = True' (there are articles with '<style ...'
    after the article content which gets included in the EPUB otherwise)
The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!
miwie is offline   Reply With Quote
Old 03-08-2011, 03:21 AM   #10
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
I'm trying to extract articles from this page (sorry the content is in Thai language)

http://www.naewna.com/allnews.asp?ID=79

When viewing the source, I need to extract article content from the article links from line 418-717.

Each article link would be something like

http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers)

Could you guide me?

Thanks in advance.
bthoven is offline   Reply With Quote
Old 03-08-2011, 04:45 PM   #11
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 51
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1
Quote:
Originally Posted by miwie View Post
Two suggestions for improvement:
  1. Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    (e.g. before the INDEX line). This adds the logo of El Pais at the top
    of the feed overview (which contains only one feed in this case)
  2. Activate 'no_stylesheets = True' (there are articles with '<style ...'
    after the article content which gets included in the EPUB otherwise)
The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!
I've added the masthead_url as suggested, and activated 'no_stylesheets = True' option, although the styles do not seem to make any noticeable difference in this case.

I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = "Babelia Feed"
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote
Old 03-09-2011, 12:05 AM   #12
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 51
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1
Quote:
Originally Posted by bthoven View Post
I'm trying to extract articles from this page (sorry the content is in Thai language)

http://www.naewna.com/allnews.asp?ID=79

When viewing the source, I need to extract article content from the article links from line 418-717.

Each article link would be something like

http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers)

Could you guide me?

Thanks in advance.

I took a look at your Thai source, and modified my recipe to extract your links. I find a problem however - the Thai text is not correctly rendered, and while I can view the resulting e-book in MobiPocket Reader, and it looks like the desired e-book (the images in the articles appear correct), the text is not proper Unicode. The e-book crashes the Calibre EPUB reader, and causes errors on my Kindle.

It is possible that you may be able to use the recipe below on a computer running a Thai version of the operating system (I use English language Windows 7 Professional), but I suspect that you will have the same text problem, as I suspect that it is because of the encoding of the source web pages, content="text/html; charset=windows-874".

The source for http://www.naewna.com/allnews.asp?ID=79 starts with:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-874">
whereas http://www.thairath.co.th/rss/news.xml (for Thairath, a built-in Thai recipe which renders correctly for me) starts with:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
	<title>...
It seems likely to me that the problem is that UTF-8 Thai pages render correctly, but that windows-874 Thai pages do not render correctly, when processed by Calibre. The improper "Unicode" text then causes the Calibre EPUB reader crash (Calibre itself continues to run, only the separate reader process crashes). A test on your computer, which I assume has a Thai language operating system, should determine whether my suspicion is correct. I have added logging of the link extraction so that you can see this even if extraction fails. I have built the e-book a number of times, but had one failure which I suspect is caused by some combination of corrupt Unicode characters. I have also commented out the article editing to leave the full article. I do not read Thai so I did not spend time guessing what should be removed. When I looked at the article source however I did notice that there were not many id or class attributes on tags such as div or span, so I also suspect that removing unwanted parts of the article page may be more difficult.

Please post the result of your test. If the problem is the encoding of the source pages, it may be worth submitting this as a enhancement request/bug report. Similar problems would probably arise for other languages where multi-byte non-Unicode encoding is used.

The recipe (note warning above regarding text rendering problems and crashing of the Calibre EPUB reader):
Code:
class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = 'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    language = 'th'

    #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        #dict(name='div', attrs={'id':'utilidades'}),
        #dict(name='div', attrs={'class':'info_relacionada'}),
        #dict(name='div', attrs={'class':'mod_apoyo'}),
        #dict(name='div', attrs={'class':'contorno_f'}),
        #dict(name='div', attrs={'class':'pestanias'}),
        #dict(name='div', attrs={'class':'otros_webs'}),
        #dict(name='div', attrs={'id':'pie'})
        #]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('body'):
            section_title = self.tag_to_string(section.find('h1'))
            z = section.find('td', attrs={'background':'images/fa04.gif'})
            self.log('z', z)
            x = z.find('font')
            self.log('x', x)
            y = x.find('strong')
            self.log('y', y)
            section_title = self.tag_to_string(y)
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = u'Thai Feed'
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                #self.log('--> p: ', post)
                url = post['href']
                #self.log('--> u: ', url)
                if url.startswith('n'):
                  url = 'http://www.naewna.com/'+url
                  #self.log('--> u: ', url)
                  title = self.tag_to_string(post)
                  #self.log('--> t: ', title)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    #self.log('--> k: ', klass)
                    if klass == "style4 style15":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote
Old 03-09-2011, 01:39 AM   #13
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
Wow..Thanks a lot. I suspect the 874 code page could be the problem too. Let me try and give feedback to you.

Thanks a lot for your help. Really appreciate.
bthoven is offline   Reply With Quote
Old 03-09-2011, 03:12 AM   #14
miwie
Connoisseur
miwie began at the beginning.
 
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
More suggestios for babelia.recipe.

The metadata could be enhanced the the following changes:

Code:
    publisher  = u'Ediciones El Pa\xeds SL'
    description = u'El Pa\xeds Babelia'
    category = u'El Pa\xeds Babelia, Noticias, News, Newsfeed'

    conversion_options = {'publisher': publisher,
                          'language' : language,
                          'tags'     : category,
                          'creator'  : publisher
                        }
Unfortunately calibre overrides creator with its own name. The only way I know of would be to process the resulting EPUB-file afterwards using ebook-meta file -a real-author.

A special cover page would be nice, but I don't know any freely accessible.
miwie is offline   Reply With Quote
Old 03-09-2011, 05:37 AM   #15
bthoven
Evangelist
bthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enoughbthoven will become famous soon enough
 
bthoven's Avatar
 
Posts: 452
Karma: 544
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle'r 3, iPhone 3Gs, iPad 2, Galaxy Tab Wifi
Hi oneilpt,

I tried to fetch the news by using your script, here is the error on my side, not sure what to do next:


calibre, version 0.7.48
ERROR: Conversion Error: <b>Failed</b>: Fetch news from Jermsak_Naewna

Fetch news from Jermsak_Naewna
Resolved conversion options
calibre version: 0.7.48
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'enable_heuristics': False,
'extra_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x04F32F50>,
'insert_blank_line': False,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_ignore_margins': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindleOutput object at 0x04F38290>,
'page_breaks_before': None,
'password': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'rescale_images': False,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unwrap_lines': True,
'use_auto_toc': False,
'username': None,
'verbose': 2}
InputFormatPlugin: Recipe Input running
z <td valign="middle" background="images/fa04.gif" class="box1">
<font size="4" face="Arial, Helvetica, sans-serif"><strong>
����ѡ���ͤԴ���¤�
</strong></font>
</td>
x <font size="4" face="Arial, Helvetica, sans-serif"><strong>
����ѡ���ͤԴ���¤�
</strong></font>
y <strong>
����ѡ���ͤԴ���¤�
</strong>
section_title(1): ����ѡ���ͤԴ���¤�
section_title(2): ����ѡ���ͤԴ���¤�

--> post: <a href="news.asp?ID=252152" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=252152
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=251132" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=251132
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=250112" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=250112
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=249084" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=249084
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=248080" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=248080
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=247031" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=247031
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=246048" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=246048
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=245090" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=245090
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=244073" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=244073
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=243150" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=243150
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=242429" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=242429
--> title: ��ҹ��������´
--> class: style4 style15

--> post: <a href="news.asp?ID=241411" class="style4 style15" target="_blank">��ҹ��������´</a>
--> url: http://www.naewna.com/news.asp?ID=241411
--> title: ��ҹ��������´
--> class: style4 style15
Python function terminated unexpectedly
'class' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
File "site-packages\calibre\customize\conversion.py", line 204, in __call__
File "site-packages\calibre\web\feeds\input.py", line 105, in convert
File "site-packages\calibre\web\feeds\news.py", line 734, in download
File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index
klass = post['class']
File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__
KeyError: 'class'
bthoven is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a good way to convert partial rss to full rss feeds. Zorz Other formats 5 05-29-2010 01:17 PM
RSS Feed timezone Feedback 8 01-02-2010 07:55 PM
RSS Feed Newspaper without Calibre ggareau Sony Reader 4 07-30-2009 02:06 AM
RSS Feed Updates Alexander Turcic Announcements 0 06-11-2004 05:11 PM


All times are GMT -4. The time now is 05:34 AM.


MobileRead.com is a privately owned, operated and funded community.