03-07-2011, 04:47 AM | #1 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
How to convert newspaper which do not have RSS feed?
With Calibre, we can easily convert newspapers, with RSS feeds, to enews.
As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews? |
03-07-2011, 09:14 AM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
Advert | |
|
03-07-2011, 09:41 AM | #3 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
As a simpler example may be helpful, I have added a recipe for Babelia en El Pais, recently requested in this forum, at the end of this reply, and I have also added comments on this recipe immediately below to help you understand the process (note that indentation is important in Python, but lost in these comments. See the code for the correct indentation). As the site does not return any duplicate links, I have kept the recipe simple by not checking for duplicate links. See some of the built-in recipes to see how duplicate checking can be carried out. I hope this helps: (1) import the basic recipe and needed parts from BeautifulSoup from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString (2) declare your class, derived from BasaicNewsRecipe, and set the variable INDEX to the url for the site page with links class ElPaisBabelia(BasicNewsRecipe): title = 'El Pais Babelia' __author__ = 'oneillpt' description = 'El Pais Babelia' INDEX = 'http://www.elpais.com/suple/babelia/' language = 'es' (3) examining the page source for the individual article pages we find that the text, with some additional matter not required, is contained in a DIV section with class="estructura_2col". keep_tags specifies that we work with this section, remove_tags_before removes some links which would otherwise appear before the article. Note that we deal with article extraction here, before we deal with link extraction later by overriding parse_index remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'}) keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})] (4) remove_tags removes the additional matter not required for the article. Add this after examining the generated article output, identifying the unwanted matter in the original page source remove_tags = [dict(name='div', attrs={'class':'votos estirar'}), dict(name='div', attrs={'id':'utilidades'}), dict(name='div', attrs={'class':'info_relacionada'}), dict(name='div', attrs={'class':'mod_apoyo'}), dict(name='div', attrs={'class':'contorno_f'}), dict(name='div', attrs={'class':'pestanias'}), dict(name='div', attrs={'class':'otros_webs'}), dict(name='div', attrs={'id':'pie'}) ] (5) you will probably want to remove javascript, and may want to disable loading of stylesheets. Here, this does not make much difference, so I have retained the line for future use if desired, but made it a comment using "#" #no_stylesheets = True remove_javascript = True (6) parse_index finds the article links, using the INDEX variable, and looking for links in a DIV with class="contenedor_nuevo". No cover image is specified. All subsequent lines here are part of parse_index. See the code for the correct indentation structure def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) cover = None feeds = [] for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}): section_title = self.tag_to_string(section.find('h1')) articles = [] (7) all article links have a "href" attribute for post in section.findAll('a', href=True): url = post['href'] (8) other links may also have a "href" attribute, but article links will start with "/", and need the base url appended if url.startswith('/'): url = 'http://www.elpais.es'+url title = self.tag_to_string(post) (9) we may still have too many links, but all article links will have a class attribute. This class attribute changes, so we just check for existence, not value. Two points to note are that the class variable has been named klass as class appears to be a reserved keyword in this context, and that post['class'] will cause an error if there is no class attribute. So we first convert the post soup to a string, and check whether it contains "class=" if str(post).find('class=') > 0: klass = post['class'] if klass != "": (10) you may find it useful to log output to see what is happening. This output will appear in the job details when built with Calibre. Remember that you can also perform manual extraction from a command prompt: ebook-extract ElPaisBabelia.recipe ELPB --test -vv and in this case you can examine the html source for the two articles which will be extracted in the ELPB folder structure self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) (11) build the list of article links articles.append({'title':title, 'url':url}) (12) and if any article links have been found, append the article list to the feed list, which is finally returned if articles: feeds.append((section_title, articles)) return feeds Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString class ElPaisBabelia(BasicNewsRecipe): title = 'El Pais Babelia' __author__ = 'oneillpt' description = 'El Pais Babelia' INDEX = 'http://www.elpais.com/suple/babelia/' language = 'es' remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'}) keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})] remove_tags = [dict(name='div', attrs={'class':'votos estirar'}), dict(name='div', attrs={'id':'utilidades'}), dict(name='div', attrs={'class':'info_relacionada'}), dict(name='div', attrs={'class':'mod_apoyo'}), dict(name='div', attrs={'class':'contorno_f'}), dict(name='div', attrs={'class':'pestanias'}), dict(name='div', attrs={'class':'otros_webs'}), dict(name='div', attrs={'id':'pie'}) ] #no_stylesheets = True remove_javascript = True def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) cover = None feeds = [] for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}): section_title = self.tag_to_string(section.find('h1')) articles = [] for post in section.findAll('a', href=True): url = post['href'] if url.startswith('/'): url = 'http://www.elpais.es'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) return feeds |
|
03-07-2011, 06:53 PM | #4 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
Wow. Thanks a lot. I'll try and let you know.
|
03-07-2011, 07:20 PM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: kindle
|
wonderful example. Thank You! |
Advert | |
|
03-07-2011, 09:27 PM | #6 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
Just wonder both The New York Times and El Pais Babelia have their RSS pages,
http://www.nytimes.com/services/xml/rss/index.html http://www.elpais.com/rss/index.html why don't we start from there? The newspaper I'm interested in does not have RSS at all; just to confirm I still can use your example above? |
03-07-2011, 10:25 PM | #7 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
If your newspaper does not have RSS you need a recipe similar to mine (or the more complex examples in the built-in recipes if the html structure is more complex), and you can modify my example to help you get started. |
|
03-07-2011, 10:33 PM | #8 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
oneillpt...thanks a lot. That's clear to me now.
So far I just use some simple tag expression, this is quite a jumping step for me. However, it's worth trying. |
03-08-2011, 02:00 AM | #9 |
Connoisseur
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
Two suggestions for improvement:
Good work! |
03-08-2011, 02:21 AM | #10 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
I'm trying to extract articles from this page (sorry the content is in Thai language)
http://www.naewna.com/allnews.asp?ID=79 When viewing the source, I need to extract article content from the article links from line 418-717. Each article link would be something like http://www.naewna.com/news.asp?ID=241411 (or some other ID numbers) Could you guide me? Thanks in advance. |
03-08-2011, 03:45 PM | #11 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now: Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString class ElPaisBabelia(BasicNewsRecipe): title = 'El Pais Babelia' __author__ = 'oneillpt' description = 'El Pais Babelia' masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif' INDEX = 'http://www.elpais.com/suple/babelia/' language = 'es' remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'}) keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})] remove_tags = [dict(name='div', attrs={'class':'votos estirar'}), dict(name='div', attrs={'id':'utilidades'}), dict(name='div', attrs={'class':'info_relacionada'}), dict(name='div', attrs={'class':'mod_apoyo'}), dict(name='div', attrs={'class':'contorno_f'}), dict(name='div', attrs={'class':'pestanias'}), dict(name='div', attrs={'class':'otros_webs'}), dict(name='div', attrs={'id':'pie'}) ] no_stylesheets = True remove_javascript = True def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) cover = None feeds = [] seen_titles = set([]) for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}): section_title = self.tag_to_string(section.find('h1')) self.log('section_title(1): ', section_title) if section_title == "": section_title = "Babelia Feed" self.log('section_title(2): ', section_title) articles = [] for post in section.findAll('a', href=True): url = post['href'] if url.startswith('/'): url = 'http://www.elpais.es'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) return feeds |
|
03-08-2011, 11:05 PM | #12 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
I took a look at your Thai source, and modified my recipe to extract your links. I find a problem however - the Thai text is not correctly rendered, and while I can view the resulting e-book in MobiPocket Reader, and it looks like the desired e-book (the images in the articles appear correct), the text is not proper Unicode. The e-book crashes the Calibre EPUB reader, and causes errors on my Kindle. It is possible that you may be able to use the recipe below on a computer running a Thai version of the operating system (I use English language Windows 7 Professional), but I suspect that you will have the same text problem, as I suspect that it is because of the encoding of the source web pages, content="text/html; charset=windows-874". The source for http://www.naewna.com/allnews.asp?ID=79 starts with: Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-874"> Code:
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"> <channel> <title>... Please post the result of your test. If the problem is the encoding of the source pages, it may be worth submitting this as a enhancement request/bug report. Similar problems would probably arise for other languages where multi-byte non-Unicode encoding is used. The recipe (note warning above regarding text rendering problems and crashing of the Calibre EPUB reader): Code:
class thai(BasicNewsRecipe): title = u'thai' __author__ = 'oneillpt' #masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif' INDEX = 'http://www.naewna.com/allnews.asp?ID=79' language = 'th' #remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'}) #keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})] #remove_tags = [dict(name='div', attrs={'class':'votos estirar'}), #dict(name='div', attrs={'id':'utilidades'}), #dict(name='div', attrs={'class':'info_relacionada'}), #dict(name='div', attrs={'class':'mod_apoyo'}), #dict(name='div', attrs={'class':'contorno_f'}), #dict(name='div', attrs={'class':'pestanias'}), #dict(name='div', attrs={'class':'otros_webs'}), #dict(name='div', attrs={'id':'pie'}) #] no_stylesheets = True remove_javascript = True def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) cover = None feeds = [] for section in soup.findAll('body'): section_title = self.tag_to_string(section.find('h1')) z = section.find('td', attrs={'background':'images/fa04.gif'}) self.log('z', z) x = z.find('font') self.log('x', x) y = x.find('strong') self.log('y', y) section_title = self.tag_to_string(y) self.log('section_title(1): ', section_title) if section_title == "": section_title = u'Thai Feed' self.log('section_title(2): ', section_title) articles = [] for post in section.findAll('a', href=True): #self.log('--> p: ', post) url = post['href'] #self.log('--> u: ', url) if url.startswith('n'): url = 'http://www.naewna.com/'+url #self.log('--> u: ', url) title = self.tag_to_string(post) #self.log('--> t: ', title) if str(post).find('class=') > 0: klass = post['class'] #self.log('--> k: ', klass) if klass == "style4 style15": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) return feeds |
|
03-09-2011, 12:39 AM | #13 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
Wow..Thanks a lot. I suspect the 874 code page could be the problem too. Let me try and give feedback to you.
Thanks a lot for your help. Really appreciate. |
03-09-2011, 02:12 AM | #14 |
Connoisseur
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
More suggestios for babelia.recipe.
The metadata could be enhanced the the following changes: Code:
publisher = u'Ediciones El Pa\xeds SL' description = u'El Pa\xeds Babelia' category = u'El Pa\xeds Babelia, Noticias, News, Newsfeed' conversion_options = {'publisher': publisher, 'language' : language, 'tags' : category, 'creator' : publisher } A special cover page would be nice, but I don't know any freely accessible. |
03-09-2011, 04:37 AM | #15 |
Evangelist
Posts: 475
Karma: 590
Join Date: Aug 2009
Location: Bangkok, Thailand
Device: Kindle Paperwhite
|
Hi oneilpt,
I tried to fetch the news by using your script, here is the error on my side, not sure what to do next: calibre, version 0.7.48 ERROR: Conversion Error: <b>Failed</b>: Fetch news from Jermsak_Naewna Fetch news from Jermsak_Naewna Resolved conversion options calibre version: 0.7.48 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_compress': False, 'dont_download_recipe': False, 'enable_heuristics': False, 'extra_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x04F32F50>, 'insert_blank_line': False, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'mobi_ignore_margins': False, 'no_chapters_in_toc': False, 'no_inline_navbars': True, 'no_inline_toc': False, 'output_profile': <calibre.customize.profiles.KindleOutput object at 0x04F38290>, 'page_breaks_before': None, 'password': None, 'personal_doc': '[PDOC]', 'prefer_author_sort': False, 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'rescale_images': False, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unwrap_lines': True, 'use_auto_toc': False, 'username': None, 'verbose': 2} InputFormatPlugin: Recipe Input running z <td valign="middle" background="images/fa04.gif" class="box1"> <font size="4" face="Arial, Helvetica, sans-serif"><strong> ����ѡ���ͤԴ���¤� </strong></font> </td> x <font size="4" face="Arial, Helvetica, sans-serif"><strong> ����ѡ���ͤԴ���¤� </strong></font> y <strong> ����ѡ���ͤԴ���¤� </strong> section_title(1): ����ѡ���ͤԴ���¤� section_title(2): ����ѡ���ͤԴ���¤� --> post: <a href="news.asp?ID=252152" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=252152 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=251132" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=251132 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=250112" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=250112 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=249084" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=249084 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=248080" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=248080 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=247031" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=247031 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=246048" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=246048 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=245090" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=245090 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=244073" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=244073 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=243150" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=243150 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=242429" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=242429 --> title: ��ҹ��������´ --> class: style4 style15 --> post: <a href="news.asp?ID=241411" class="style4 style15" target="_blank">��ҹ��������´</a> --> url: http://www.naewna.com/news.asp?ID=241411 --> title: ��ҹ��������´ --> class: style4 style15 Python function terminated unexpectedly 'class' (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 110, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run File "site-packages\calibre\customize\conversion.py", line 204, in __call__ File "site-packages\calibre\web\feeds\input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 734, in download File "site-packages\calibre\web\feeds\news.py", line 871, in build_index File "c:\users\chotec~1\appdata\local\temp\calibre_0.7. 48_tmp_bm8qsi\calibre_0.7.48_spw2ws_recipes\recipe 0.py", line 55, in parse_index klass = post['class'] File "site-packages\calibre\ebooks\BeautifulSoup.py", line 518, in __getitem__ KeyError: 'class' |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there a good way to convert partial rss to full rss feeds. | Zorz | Other formats | 5 | 05-29-2010 12:17 PM |
RSS Feed | timezone | Feedback | 8 | 01-02-2010 06:55 PM |
RSS Feed Newspaper without Calibre | ggareau | Sony Reader | 4 | 07-30-2009 01:06 AM |
RSS Feed Updates | Alexander Turcic | Announcements | 0 | 06-11-2004 04:11 PM |