Multipage questions (Sueddeutsche Magazin)

aerodynamik · 04-25-2011, 05:59 AM

Hello,

Next to Sueddeutsche Zeitung (newspaper) there is also a magazine.
This one sometimes has articles with multiple pages, a print-version is available for some articles, but very inconsistent.

I looked at the "Adventure Gamers" multi page example and adopted the code. I had the advantage that all subsequent pages are linked on the first page and hence skipped the recursion and implemented it more simple the iterative way.

Here is my current code.

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
from BeautifulSoup import Comment

class SueddeutscheZeitungMagazin(BasicNewsRecipe):
    title                  = 'Sueddeutsche Zeitung Magazin'
    __author__             = 'Nikolas Mangold'
    description            = 'Sueddeutsche Zeitung Magazin'
    category               = 'Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://sz-magazin.sueddeutsche.de'
    INDEX                  = PREFIX + '/hefte'
    use_embedded_content   = False
    masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '%W %Y'

    remove_tags_before =  dict(attrs={'class':'vorspann'})
    remove_tags_after  =  dict(attrs={'id':'commentsContainer'})
    remove_tags = [dict(name='ul', attrs={'class':'textoptions'}),dict(name='div', attrs={'class':'BannerBug'}),dict(name='div', attrs={'id':'commentsContainer'}),dict(name='div', attrs={'class':'plugin-linkbox'})]
        
    def parse_index(self):
        # determine current issue
        index = self.index_to_soup(self.INDEX)
        year_index = index.find('ul', attrs={'class':'hefte-jahre'})
        week_index = index.find('ul', attrs={'class':'heftindex'})
        year = self.tag_to_string(year_index.find('li')).strip()
        tmp = week_index.find('li').a
        week = self.tag_to_string(tmp)
        aktuelles_heft = self.PREFIX + tmp['href']

        # set cover
        self.cover_url = 'http://sz-magazin.sueddeutsche.de/img/hefte/thumbs_l/{0}{1}.jpg'.format(year,week)

        # find articles and add to main feed
        soup = self.index_to_soup(aktuelles_heft)
        content = soup.find('div',{'id':'maincontent'})
        feed = 'SZ Magazin {0}/{1}'.format(week, year)

        feeds = []
        articles = []

        for article in content.findAll('li'):
            txt = article.find('div',{'class':'text-holder'})
            if txt is None:
                continue
            link = txt.find('a')
            desc = txt.find('p')
            title = self.tag_to_string(link).strip()
            self.log('Found article ', title)
            url = self.PREFIX + link['href']
            articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})

        feeds.append((feed,articles))
        return feeds;

    def preprocess_html(self, soup):
        # determine if multipage, if not bail out
        multipage = soup.find('ul',attrs={'class':'blaettern'})
        if multipage is None:
            return soup;
        
        # get all subsequent pages and delete multipage links
        next_pages = []
        for next in multipage.findAll('li'):
           if next.a is None:
               continue
           nexturl = next.a['href']
           nexttitle = self.tag_to_string(next).strip()
           next_pages.append((self.PREFIX + nexturl,nexttitle))
        multipage.extract()

        # extract article from subsequent pages and insert at end of first page article
        firstpage_article = soup.find('div',attrs={'id':'artikel'})
        position = len(firstpage_article.contents) # TODO
        offset = 0
        for url, title in next_pages:
            next_soup = self.index_to_soup(url)
            next_article = next_soup.find('div',attrs={'id':'artikel'})
            banner = next_article.find('div',attrs={'class':'BannerBug'})
            if banner:
                banner.extract()
            firstpage_article.insert(position + offset, next_article)
            offset += len(next_article.contents)

        return firstpage_article

Things look pretty good right now, except I have one issue with HTML comments in preprocess_html.

Code:

<!-- ad tag -->

are duplicated as

Code:

<!--<!-- ad tag -->-->

I tried extracting the comment with

Code:

comments = next_article.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]

but I think it is no longer recognized as a comment.

Why are the HTML tags re-commented again?

miwie · 04-25-2011, 07:20 AM

Really nice work for "Süddeutsche Magazin"!

Though I cannot give any hints to the question itself let me suggest the following improvements:

Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course)
Set correct metadata for language by using something like conversion_options = {'language' : language}
Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'

+Karma!

aerodynamik · 04-25-2011, 08:21 AM

Quote:

Originally Posted by miwie

Really nice work for "Süddeutsche Magazin"!

Though I cannot give any hints to the question itself let me suggest the following improvements:

Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course)
Set correct metadata for language by using something like conversion_options = {'language' : language}
Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'

+Karma!

Thanks for the feedback and the karma

I added the conversion options, the publisher and the UTF-8 text for title etc. with Umlauts.

I also took a look again at the comments in preprocess_html. Actually, the comments were still correct at when logging them. Apparently, they would really be modified (incorrectly?) after preprocess_html?

After removing the banner ad the only comment left was google_ads. Removing the comments as in the beautifulsoup documentation would not work, the comments would not be found. I found them and removed the comments with this code

Code:

            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

This is my current version.

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
import re

class SueddeutscheZeitungMagazin(BasicNewsRecipe):
    title                  = u'Süddeutsche Zeitung Magazin'
    __author__             = 'Nikolas Mangold'
    description            = u'Süddeutsche Zeitung Magazin'
    publisher              = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München'
    category               = 'Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://sz-magazin.sueddeutsche.de'
    INDEX                  = PREFIX + '/hefte'
    use_embedded_content   = False
    masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '%W %Y'

    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }

    remove_tags_before =  dict(attrs={'class':'vorspann'})
    remove_tags_after  =  dict(attrs={'id':'commentsContainer'})
    remove_tags = [dict(name='ul', attrs={'class':'textoptions'}),dict(name='div', attrs={'class':'BannerBug'}),dict(name='div', attrs={'id':'commentsContainer'}),dict(name='div', attrs={'class':'plugin-linkbox'})]
        
    def parse_index(self):
        feeds = []

        # determine current issue
        index = self.index_to_soup(self.INDEX)
        year_index = index.find('ul', attrs={'class':'hefte-jahre'})
        week_index = index.find('ul', attrs={'class':'heftindex'})
        year = self.tag_to_string(year_index.find('li')).strip()
        tmp = week_index.find('li').a
        week = self.tag_to_string(tmp)
        aktuelles_heft = self.PREFIX + tmp['href']

        # set cover
        self.cover_url = '{0}/img/hefte/thumbs_l/{1}{2}.jpg'.format(self.PREFIX,year,week)

        # find articles and add to main feed
        soup = self.index_to_soup(aktuelles_heft)
        content = soup.find('div',{'id':'maincontent'})
        mainfeed = 'SZ Magazin {0}/{1}'.format(week, year)
        articles = []
        for article in content.findAll('li'):
            txt = article.find('div',{'class':'text-holder'})
            if txt is None:
                continue
            link = txt.find('a')
            desc = txt.find('p')
            title = self.tag_to_string(link).strip()
            self.log('Found article ', title)
            url = self.PREFIX + link['href']
            articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})
        feeds.append((mainfeed,articles))

        return feeds;

    def preprocess_html(self, soup):
        # determine if multipage, if not bail out
        multipage = soup.find('ul',attrs={'class':'blaettern'})
        if multipage is None:
            return soup;
        
        # get all subsequent pages and delete multipage links
        next_pages = []
        for next in multipage.findAll('li'):
           if next.a is None:
               continue
           nexturl = next.a['href']
           nexttitle = self.tag_to_string(next).strip()
           next_pages.append((self.PREFIX + nexturl,nexttitle))
        multipage.extract()

        # extract article from subsequent pages and insert at end of first page article
        firstpage_article = soup.find('div',attrs={'id':'artikel'})
        position = len(firstpage_article.contents)
        offset = 0
        for url, title in next_pages:
            next_soup = self.index_to_soup(url)
            next_article = next_soup.find('div',attrs={'id':'artikel'})

            # remove banner ad
            banner = next_article.find('div',attrs={'class':'BannerBug'})
            if banner:
                banner.extract()

            # remove remaining HTML comments
            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

            firstpage_article.insert(position + offset, next_article)
            offset += len(next_article.contents)

        return firstpage_article

The following could still be done

Image galleries would still need fixing, but the webpage has again at least two different ways to implement image galleries
add blogs and 'kolumnen'. Again blogs are differently formatted than 'kolumnen'
Remove some extra line breaks
Some articles don't display the headline

I'll take a look at a later time. This is very good for me already as it is.

aerodynamik · 04-25-2011, 09:24 AM

[obsolete]

aerodynamik · 04-25-2011, 01:42 PM

Okay, I think this should work now. There is still room for improvement (images, line breaks, additional blogs).

The issue regarding HTML comments is still unclear to me.

In addition, I understand that remove_tags is applied after preprocess_html. Is there a smart way to re-implementing remove_tags? Is there a way to process the subsequent pages equally as any other downloaded page?

But for now, have fun with this, let me know if it works for you as well.

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Nikolas Mangold <nmangold at gmail.com>'
'''
sz-magazin.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime
import re

class SueddeutscheZeitungMagazin(BasicNewsRecipe):
    title                  = u'Süddeutsche Zeitung Magazin'
    __author__             = 'Nikolas Mangold'
    description            = u'Süddeutsche Zeitung Magazin'
    publisher              = u'Magazin Verlagsgesellschaft Süddeutsche Zeitung mbH'
    category               = 'Germany'
    no_stylesheets         = True
    encoding               = 'cp1252'
    remove_empty_feeds     = True
    delay                  = 1
    PREFIX                 = 'http://sz-magazin.sueddeutsche.de'
    INDEX                  = PREFIX + '/hefte'
    use_embedded_content   = False
    masthead_url = 'http://sz-magazin.sueddeutsche.de/img/general/logo.gif'
    language               = 'de'
    publication_type       = 'magazine'
    extra_css              = ' body{font-family: Arial,Helvetica,sans-serif} '
    timefmt = '%W %Y'

    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }

    remove_tags_before =  dict(attrs={'class':'vorspann'})
    remove_tags_after  =  dict(attrs={'id':'commentsContainer'})
    remove_tags = [
         dict(name='ul',attrs={'class':'textoptions'}),
         dict(name='div',attrs={'class':'BannerBug'}),
         dict(name='div',attrs={'id':'commentsContainer'}),
         dict(name='div',attrs={'class':'plugin-linkbox'}), #not working
         dict(name='div',attrs={'id':'galleryInfo0'}),
         dict(name='div',attrs={'class':'control'})
    ]
        
    def parse_index(self):
        feeds = []

        # determine current issue
        index = self.index_to_soup(self.INDEX)
        year_index = index.find('ul', attrs={'class':'hefte-jahre'})
        week_index = index.find('ul', attrs={'class':'heftindex'})
        year = self.tag_to_string(year_index.find('li')).strip()
        tmp = week_index.find('li').a
        week = self.tag_to_string(tmp)
        aktuelles_heft = self.PREFIX + tmp['href']

        # set cover
        self.cover_url = '{0}/img/hefte/thumbs_l/{1}{2}.jpg'.format(self.PREFIX,year,week)

        # find articles and add to main feed
        soup = self.index_to_soup(aktuelles_heft)
        content = soup.find('div',{'id':'maincontent'})
        mainfeed = 'SZ Magazin {0}/{1}'.format(week, year)
        articles = []
        for article in content.findAll('li'):
            txt = article.find('div',{'class':'text-holder'})
            if txt is None:
                continue
            link = txt.find('a')
            desc = txt.find('p')
            title = self.tag_to_string(link).strip()
            self.log('Found article ', title)
            url = self.PREFIX + link['href']
            articles.append({'title' : title, 'date' : strftime(self.timefmt), 'url' : url, 'desc' : desc})
        feeds.append((mainfeed,articles))

        return feeds;

    def preprocess_html(self, soup):
        # determine if multipage, if not bail out
        multipage = soup.find('ul',attrs={'class':'blaettern'})
        if multipage is None:
            return soup;
        
        # get all subsequent pages and delete multipage links
        next_pages = []
        for next in multipage.findAll('li'):
           if next.a is None:
               continue
           nexturl = next.a['href']
           nexttitle = self.tag_to_string(next).strip()
           next_pages.append((self.PREFIX + nexturl,nexttitle))
        multipage.extract()

        # extract article from subsequent pages and insert at end of first page article
        firstpage = soup.find('body')
        firstpage_header = firstpage.find('div',attrs={'class':'vorspann'})
        firstpage_article = firstpage.find('div',attrs={'id':'artikel'})
        firstpage_header.insert(len(firstpage_header.contents),firstpage_article)

        for url, title in next_pages:
            next_soup = self.index_to_soup(url)
            next_article = next_soup.find('div',attrs={'id':'artikel'})

            # remove banner ad
            banner = next_article.find('div',attrs={'class':'BannerBug'})
            if banner:
                banner.extract()

            # remove remaining HTML comments
            comments = next_article.findAll(text=re.compile('google_ad'))
            [comment.extract() for comment in comments]

            firstpage_header.insert(len(firstpage_header.contents), next_article)

        return firstpage_header

kovidgoyal · 04-25-2011, 01:54 PM

The method you are looking for is called

postprocess_html

aerodynamik · 04-25-2011, 02:02 PM

Quote:

Originally Posted by kovidgoyal

The method you are looking for is called
postprocess_html

The obvious choice

This part in the documentation threw me off: "after it is parsed for links and images". I was hoping to parse image galleries at a later point and wasn't sure if the images would still be there correctly. I'll give it a try.

Can you shed some light on the HTML-comments issue I have had and worked around, i.e.  becomes -->. Thanks in advance.

kovidgoyal · 04-25-2011, 02:15 PM

That can happen in various ways when you are manipulating the HTML. To avoid it, I typically just strip all comments with a regexp in preprocess_regexps

aerodynamik · 04-25-2011, 02:20 PM

Quote:

Originally Posted by kovidgoyal

That can happen in various ways when you are manipulating the HTML. To avoid it, I typically just strip all comments with a regexp in preprocess_regexps

Okay.

Gave postprocess_html a quick try. Obviously it has the processed page that was downloaded by adding it to feeds. However, the additional multi-pages that I download within this method are obviously not processed with remove_tags.

Not sure, I understood your original comment correctly. Did you mean that I should implement "remove all tags in remove_tags" in postprocess_html, since the pages I download in preprocess_html would then also be processed in postprocess_html?

kovidgoyal · 04-25-2011, 02:30 PM

All pages are processed by postprocess_html and all pages have remove_tags applied to them.

aerodynamik · 04-26-2011, 04:22 PM

Quote:

Originally Posted by kovidgoyal

All pages are processed by postprocess_html and all pages have remove_tags applied to them.

I did some re-tests and am sorry to say I cannot confirm this.

For additional pages, I download in preprocess or postprocess_html with self.index_to_soup(url), remove_tags is not applied. (In my case, a certain div is not removed.)

If I log the soup given to preprocess_html, remove_tags has already been applied. (In my case, that certain div is already removed.)

If I download additional pages with self.index_to_soup(url) in preprocess_html and add it to the original first page with "insert", this very page then is processed by postprocess_html. remove_tags is not re-applied to this complete page. (In my case, that certain div is not removed from the complete page then.)

I'm not complaining here, this sounds more then logical

I am just curious if there there is any way to re-process a page downloaded within preprocess_html the same way any other page is downloaded? I only have the remove_tags issue now, and can certainly re-implement it in preprocess_html, this doesn't sound like a smart way to do it though.

Cheers,
- aero

kovidgoyal · 04-26-2011, 04:50 PM

Look at the function get_soup in fetch/simple.py. This function is called by process_links to create soup for every link that is followed. And it explicitly applies remove_tags and co.

04-25-2011, 07:20 AM	#2
miwie Connoisseur Posts: 76 Karma: 12 Join Date: Nov 2010 Device: Android, PB Pro 602	Really nice work for "Süddeutsche Magazin"! Though I cannot give any hints to the question itself let me suggest the following improvements: Use of UTF-8 text for metadata (e.g. title) by prepending text with 'u' (and use Umlauts in the text istelf of course) Set correct metadata for language by using something like conversion_options = {'language' : language} Set publisher in metadata, e.g. like publisher = u'Magazin Verlagsgesellschaft / Süddeutsche Zeitung mbH / 81677 München' +Karma!

04-25-2011, 09:24 AM	#4
aerodynamik Enthusiast Posts: 43 Karma: 136 Join Date: Mar 2011 Device: Kindle Paperwhite	[obsolete] Last edited by aerodynamik; 04-25-2011 at 01:43 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Request: Multipage recipe for Reuters	wutangmark	Recipes	1	12-31-2010 09:24 PM
Good e-Reader Magazin	Marc_liest	Deutsches Forum	0	10-04-2010 05:08 AM
Calibre, Instapaper, multipage articles and ordering	flyash	Calibre	1	06-10-2010 08:03 PM
Multipage HTML file > Mobi or PDF?	Dinah-Moe Humm	Other formats	4	06-01-2010 04:43 PM
BeBook Zusammenfassung zum BeBook im c't Magazin No.9 '09	beachwanderer	Andere Lesegeräte	0	04-14-2009 05:22 AM

04-25-2011, 01:54 PM	#6
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The method you are looking for is called postprocess_html

04-25-2011, 02:15 PM	#8
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That can happen in various ways when you are manipulating the HTML. To avoid it, I typically just strip all comments with a regexp in preprocess_regexps

04-25-2011, 02:30 PM	#10
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	All pages are processed by postprocess_html and all pages have remove_tags applied to them.

04-26-2011, 04:50 PM	#12
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Look at the function get_soup in fetch/simple.py. This function is called by process_links to create soup for every link that is followed. And it explicitly applies remove_tags and co.