How to convert newspaper which do not have RSS feed?

bthoven · 03-07-2011, 04:47 AM

With Calibre, we can easily convert newspapers, with RSS feeds, to enews.

As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?

Starson17 · 03-07-2011, 09:14 AM

Quote:

Originally Posted by bthoven

there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?

Yes. See parse_index: http://calibre-ebook.com/user_manual...pe.parse_index

oneillpt · 03-07-2011, 09:41 AM

Quote:

Originally Posted by bthoven

With Calibre, we can easily convert newspapers, with RSS feeds, to enews.

As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?

You need to override the parse_index procedure. The NYTimes example in the Calibre User Manual, http://calibre-ebook.com/user_manual/news.html, shows how this can be done. Grep parse_index in the built-in recipes to find more examples.

As a simpler example may be helpful, I have added a recipe for Babelia en El Pais, recently requested in this forum, at the end of this reply, and I have also added comments on this recipe immediately below to help you understand the process (note that indentation is important in Python, but lost in these comments. See the code for the correct indentation). As the site does not return any duplicate links, I have kept the recipe simple by not checking for duplicate links. See some of the built-in recipes to see how duplicate checking can be carried out.

I hope this helps:

(1) import the basic recipe and needed parts from BeautifulSoup

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

(2) declare your class, derived from BasaicNewsRecipe, and set the variable INDEX to the url for the site page with links

class ElPaisBabelia(BasicNewsRecipe):

title = 'El Pais Babelia'
__author__ = 'oneillpt'
description = 'El Pais Babelia'
INDEX = 'http://www.elpais.com/suple/babelia/'
language = 'es'

(3) examining the page source for the individual article pages we find that the text, with some additional matter not required, is contained in a DIV section with class="estructura_2col". keep_tags specifies that we work with this section, remove_tags_before removes some links which would otherwise appear before the article. Note that we deal with article extraction here, before we deal with link extraction later by overriding parse_index

remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]

(4) remove_tags removes the additional matter not required for the article. Add this after examining the generated article output, identifying the unwanted matter in the original page source

remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
dict(name='div', attrs={'id':'utilidades'}),
dict(name='div', attrs={'class':'info_relacionada'}),
dict(name='div', attrs={'class':'mod_apoyo'}),
dict(name='div', attrs={'class':'contorno_f'}),
dict(name='div', attrs={'class':'pestanias'}),
dict(name='div', attrs={'class':'otros_webs'}),
dict(name='div', attrs={'id':'pie'})
]

(5) you will probably want to remove javascript, and may want to disable loading of stylesheets. Here, this does not make much difference, so I have retained the line for future use if desired, but made it a comment using "#"

#no_stylesheets = True
remove_javascript = True

(6) parse_index finds the article links, using the INDEX variable, and looking for links in a DIV with class="contenedor_nuevo". No cover image is specified. All subsequent lines here are part of parse_index. See the code for the correct indentation structure

def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
section_title = self.tag_to_string(section.find('h1'))
articles = []

(7) all article links have a "href" attribute

for post in section.findAll('a', href=True):
url = post['href']

(8) other links may also have a "href" attribute, but article links will start with "/", and need the base url appended

if url.startswith('/'):
url = 'http://www.elpais.es'+url
title = self.tag_to_string(post)

(9) we may still have too many links, but all article links will have a class attribute. This class attribute changes, so we just check for existence, not value. Two points to note are that the class variable has been named klass as class appears to be a reserved keyword in this context, and that post['class'] will cause an error if there is no class attribute. So we first convert the post soup to a string, and check whether it contains "class="

if str(post).find('class=') > 0:
klass = post['class']
if klass != "":

(10) you may find it useful to log output to see what is happening. This output will appear in the job details when built with Calibre. Remember that you can also perform manual extraction from a command prompt:

ebook-extract ElPaisBabelia.recipe ELPB --test -vv

and in this case you can examine the html source for the two articles which will be extracted in the ELPB folder structure

self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
self.log('--> class: ', klass)

(11) build the list of article links

articles.append({'title':title, 'url':url})

(12) and if any article links have been found, append the article list to the feed list, which is finally returned

if articles:
feeds.append((section_title, articles))
return feeds

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    #no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds

bthoven · 03-07-2011, 06:53 PM

Wow. Thanks a lot. I'll try and let you know.

luiscc · 03-07-2011, 07:20 PM

wonderful example.
Thank You!

bthoven · 03-07-2011, 09:27 PM

Just wonder both The New York Times and El Pais Babelia have their RSS pages,

http://www.nytimes.com/services/xml/rss/index.html
http://www.elpais.com/rss/index.html

why don't we start from there?

The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above?

oneillpt · 03-07-2011, 10:25 PM

Quote:

Originally Posted by bthoven

Just wonder both The New York Times and El Pais Babelia have their RSS pages,

http://www.nytimes.com/services/xml/rss/index.html
http://www.elpais.com/rss/index.html

why don't we start from there?

The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above?

El Pais does have RSS feeds, but not for Babelia en El Pais, so you cannot start from an RSS feed in that case. The New York Times recipe is shown as an example in the Calibre User Manual, but I do not know why the RSS feed is not used in that case. I suspect some RSS feeds may be less usable than others.

If your newspaper does not have RSS you need a recipe similar to mine (or the more complex examples in the built-in recipes if the html structure is more complex), and you can modify my example to help you get started.

bthoven · 03-07-2011, 10:33 PM

oneillpt...thanks a lot. That's clear to me now.

So far I just use some simple tag expression, this is quite a jumping step for me. However, it's worth trying.

miwie · 03-08-2011, 02:00 AM

Two suggestions for improvement:

Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
(e.g. before the INDEX line). This adds the logo of El Pais at the top
of the feed overview (which contains only one feed in this case)
Activate 'no_stylesheets = True' (there are articles with '<style ...'
after the article content which gets included in the EPUB otherwise)

The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!

bthoven · 03-08-2011, 02:21 AM

I'm trying to extract articles from this page (sorry the content is in Thai language)

http://www.naewna.com/allnews.asp?ID=79

When viewing the source, I need to extract article content from the article links from line 418-717.

Each article link would be something like

http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers)

Could you guide me?

Thanks in advance.

oneillpt · 03-08-2011, 03:45 PM

Quote:

Originally Posted by miwie

Two suggestions for improvement:

Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
(e.g. before the INDEX line). This adds the logo of El Pais at the top
of the feed overview (which contains only one feed in this case)
Activate 'no_stylesheets = True' (there are articles with '<style ...'
after the article content which gets included in the EPUB otherwise)

The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!

I've added the masthead_url as suggested, and activated 'no_stylesheets = True' option, although the styles do not seem to make any noticeable difference in this case.

I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = "Babelia Feed"
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds

oneillpt · 03-08-2011, 11:05 PM

Quote:

Originally Posted by bthoven

I'm trying to extract articles from this page (sorry the content is in Thai language)

http://www.naewna.com/allnews.asp?ID=79

When viewing the source, I need to extract article content from the article links from line 418-717.

Each article link would be something like

http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers)

Could you guide me?

Thanks in advance.

I took a look at your Thai source, and modified my recipe to extract your links. I find a problem however - the Thai text is not correctly rendered, and while I can view the resulting e-book in MobiPocket Reader, and it looks like the desired e-book (the images in the articles appear correct), the text is not proper Unicode. The e-book crashes the Calibre EPUB reader, and causes errors on my Kindle.

It is possible that you may be able to use the recipe below on a computer running a Thai version of the operating system (I use English language Windows 7 Professional), but I suspect that you will have the same text problem, as I suspect that it is because of the encoding of the source web pages, content="text/html; charset=windows-874".

The source for http://www.naewna.com/allnews.asp?ID=79 starts with:

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-874">

whereas http://www.thairath.co.th/rss/news.xml (for Thairath, a built-in Thai recipe which renders correctly for me) starts with:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
	<title>...

It seems likely to me that the problem is that UTF-8 Thai pages render correctly, but that windows-874 Thai pages do not render correctly, when processed by Calibre. The improper "Unicode" text then causes the Calibre EPUB reader crash (Calibre itself continues to run, only the separate reader process crashes). A test on your computer, which I assume has a Thai language operating system, should determine whether my suspicion is correct. I have added logging of the link extraction so that you can see this even if extraction fails. I have built the e-book a number of times, but had one failure which I suspect is caused by some combination of corrupt Unicode characters. I have also commented out the article editing to leave the full article. I do not read Thai so I did not spend time guessing what should be removed. When I looked at the article source however I did notice that there were not many id or class attributes on tags such as div or span, so I also suspect that removing unwanted parts of the article page may be more difficult.

Please post the result of your test. If the problem is the encoding of the source pages, it may be worth submitting this as a enhancement request/bug report. Similar problems would probably arise for other languages where multi-byte non-Unicode encoding is used.

The recipe (note warning above regarding text rendering problems and crashing of the Calibre EPUB reader):

Code:

class thai(BasicNewsRecipe):

    title      = u'thai'
    __author__ = 'oneillpt'
    #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.naewna.com/allnews.asp?ID=79'
    language = 'th'

    #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        #dict(name='div', attrs={'id':'utilidades'}),
        #dict(name='div', attrs={'class':'info_relacionada'}),
        #dict(name='div', attrs={'class':'mod_apoyo'}),
        #dict(name='div', attrs={'class':'contorno_f'}),
        #dict(name='div', attrs={'class':'pestanias'}),
        #dict(name='div', attrs={'class':'otros_webs'}),
        #dict(name='div', attrs={'id':'pie'})
        #]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('body'):
            section_title = self.tag_to_string(section.find('h1'))
            z = section.find('td', attrs={'background':'images/fa04.gif'})
            self.log('z', z)
            x = z.find('font')
            self.log('x', x)
            y = x.find('strong')
            self.log('y', y)
            section_title = self.tag_to_string(y)
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = u'Thai Feed'
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                #self.log('--> p: ', post)
                url = post['href']
                #self.log('--> u: ', url)
                if url.startswith('n'):
                  url = 'http://www.naewna.com/'+url
                  #self.log('--> u: ', url)
                  title = self.tag_to_string(post)
                  #self.log('--> t: ', title)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    #self.log('--> k: ', klass)
                    if klass == "style4 style15":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds

bthoven · 03-09-2011, 12:39 AM

Wow..Thanks a lot. I suspect the 874 code page could be the problem too. Let me try and give feedback to you.

Thanks a lot for your help. Really appreciate.

miwie · 03-09-2011, 02:12 AM

More suggestios for babelia.recipe.

The metadata could be enhanced the the following changes:

Code:

    publisher  = u'Ediciones El Pa\xeds SL'
    description = u'El Pa\xeds Babelia'
    category = u'El Pa\xeds Babelia, Noticias, News, Newsfeed'

    conversion_options = {'publisher': publisher,
                          'language' : language,
                          'tags'     : category,
                          'creator'  : publisher
                        }

Unfortunately calibre overrides creator with its own name. The only way I know of would be to process the resulting EPUB-file afterwards using ebook-meta file -a real-author.

A special cover page would be nice, but I don't know any freely accessible.

bthoven · 03-09-2011, 04:37 AM

Hi oneilpt,

I tried to fetch the news by using your script, here is the error on my side, not sure what to do next:

calibre, version 0.7.48
ERROR: Conversion Error: Failed: Fetch news from Jermsak_Naewna

Fetch news from Jermsak_Naewna
Resolved conversion options
calibre version: 0.7.48
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'enable_heuristics': False,
'extra_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x04F32F50>,
'insert_blank_line': False,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_ignore_margins': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindleOutput object at 0x04F38290>,
'page_breaks_before': None,
'password': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'rescale_images': False,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unwrap_lines': True,
'use_auto_toc': False,
'username': None,
'verbose': 2}
InputFormatPlugin: Recipe Input running
z <td valign="middle" background="images/fa04.gif" class="box1">

��ѡ��ͤԴ��¤�

</td>
x 
��ѡ��ͤԴ��¤�

y 
��ѡ��ͤԴ��¤�

section_title(1): ��ѡ��ͤԴ��¤�
section_title(2): ��ѡ��ͤԴ��¤�

--> post: <a href="news.asp?ID=252152" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=252152
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=251132" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=251132
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=250112" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=250112
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=249084" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=249084
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=248080" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=248080
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=247031" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=247031
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=246048" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=246048
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=245090" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=245090
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=244073" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=244073
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=243150" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=243150
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=242429" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=242429
--> title: ��ҹ��´
--> class: style4 style15

--> post: <a href="news.asp?ID=241411" class="style4 style15" target="_blank">��ҹ��´</a>
--> url: http://www.naewna.com/news.asp?ID=241411
--> title: ��ҹ��´
--> class: style4 style15
Python function terminated unexpectedly
'class' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
File "site-packages\calibre\customize\conversion.py", line 204, in __call__
File "site-packages\calibre\web\feeds\input.py", line 105, in convert
File "site-packages\calibre\web\feeds\news.py", line 734, in download
File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index
klass = post['class']
File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__
KeyError: 'class'

03-07-2011, 04:47 AM	#1
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	How to convert newspaper which do not have RSS feed? With Calibre, we can easily convert newspapers, with RSS feeds, to enews. As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?

03-08-2011, 02:00 AM	#9
miwie Connoisseur Posts: 76 Karma: 12 Join Date: Nov 2010 Device: Android, PB Pro 602	Two suggestions for improvement: Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif' (e.g. before the INDEX line). This adds the logo of El Pais at the top of the feed overview (which contains only one feed in this case) Activate 'no_stylesheets = True' (there are articles with '<style ...' after the article content which gets included in the EPUB otherwise) The name of the feed appears as "Unknown feed' which should be renamed somehow. Good work!

03-09-2011, 02:12 AM	#14
miwie Connoisseur Posts: 76 Karma: 12 Join Date: Nov 2010 Device: Android, PB Pro 602	More suggestios for babelia.recipe. The metadata could be enhanced the the following changes: Code: publisher = u'Ediciones El Pa\xeds SL' description = u'El Pa\xeds Babelia' category = u'El Pa\xeds Babelia, Noticias, News, Newsfeed' conversion_options = {'publisher': publisher, 'language' : language, 'tags' : category, 'creator' : publisher } Unfortunately calibre overrides creator with its own name. The only way I know of would be to process the resulting EPUB-file afterwards using ebook-meta file -a real-author. A special cover page would be nice, but I don't know any freely accessible.

03-09-2011, 04:37 AM	#15
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Hi oneilpt, I tried to fetch the news by using your script, here is the error on my side, not sure what to do next: calibre, version 0.7.48 ERROR: Conversion Error: <b>Failed</b>: Fetch news from Jermsak_Naewna Fetch news from Jermsak_Naewna Resolved conversion options calibre version: 0.7.48 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_compress': False, 'dont_download_recipe': False, 'enable_heuristics': False, 'extra_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x04F32F50>, 'insert_blank_line': False, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'mobi_ignore_margins': False, 'no_chapters_in_toc': False, 'no_inline_navbars': True, 'no_inline_toc': False, 'output_profile': <calibre.customize.profiles.KindleOutput object at 0x04F38290>, 'page_breaks_before': None, 'password': None, 'personal_doc': '[PDOC]', 'prefer_author_sort': False, 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'rescale_images': False, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unwrap_lines': True, 'use_auto_toc': False, 'username': None, 'verbose': 2} InputFormatPlugin: Recipe Input running z <td valign="middle" background="images/fa04.gif" class="box1"> <font size="4" face="Arial, Helvetica, sans-serif"><strong> ��ѡ��ͤԴ��¤� </strong></font> </td> x <font size="4" face="Arial, Helvetica, sans-serif"><strong> ��ѡ��ͤԴ��¤� </strong></font> y <strong> ��ѡ��ͤԴ��¤� </strong> section_title(1): ��ѡ��ͤԴ��¤� section_title(2): ��ѡ��ͤԴ��¤� --> post: <a href="news.asp?ID=252152" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=252152 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=251132" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=251132 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=250112" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=250112 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=249084" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=249084 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=248080" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=248080 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=247031" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=247031 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=246048" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=246048 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=245090" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=245090 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=244073" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=244073 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=243150" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=243150 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=242429" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=242429 --> title: ��ҹ��´ --> class: style4 style15 --> post: <a href="news.asp?ID=241411" class="style4 style15" target="_blank">��ҹ��´</a> --> url: http://www.naewna.com/news.asp?ID=241411 --> title: ��ҹ��´ --> class: style4 style15 Python function terminated unexpectedly 'class' (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 110, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run File "site-packages\calibre\customize\conversion.py", line 204, in __call__ File "site-packages\calibre\web\feeds\input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 734, in download File "site-packages\calibre\web\feeds\news.py", line 871, in build_index File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index klass = post['class'] File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__ KeyError: 'class'

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is there a good way to convert partial rss to full rss feeds.	Zorz	Other formats	5	05-29-2010 12:17 PM
RSS Feed	timezone	Feedback	8	01-02-2010 06:55 PM
RSS Feed Newspaper without Calibre	ggareau	Sony Reader	4	07-30-2009 01:06 AM
RSS Feed Updates	Alexander Turcic	Announcements	0	06-11-2004 04:11 PM

03-07-2011, 06:53 PM	#4
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Wow. Thanks a lot. I'll try and let you know.

03-07-2011, 07:20 PM	#5
luiscc Junior Member Posts: 9 Karma: 10 Join Date: Feb 2011 Device: kindle	wonderful example. Thank You!

03-07-2011, 09:27 PM	#6
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Just wonder both The New York Times and El Pais Babelia have their RSS pages, http://www.nytimes.com/services/xml/rss/index.html http://www.elpais.com/rss/index.html why don't we start from there? The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above?

03-07-2011, 10:33 PM	#8
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	oneillpt...thanks a lot. That's clear to me now. So far I just use some simple tag expression, this is quite a jumping step for me. However, it's worth trying.

03-08-2011, 02:21 AM	#10
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	I'm trying to extract articles from this page (sorry the content is in Thai language) http://www.naewna.com/allnews.asp?ID=79 When viewing the source, I need to extract article content from the article links from line 418-717. Each article link would be something like http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers) Could you guide me? Thanks in advance.

03-09-2011, 12:39 AM	#13
bthoven Evangelist Posts: 475 Karma: 590 Join Date: Aug 2009 Location: Bangkok, Thailand Device: Kindle Paperwhite	Wow..Thanks a lot. I suspect the 874 code page could be the problem too. Let me try and give feedback to you. Thanks a lot for your help. Really appreciate.

Advert

Advert