Parsing Chron.com with Beautiful Soup

dkfurrow · 06-12-2013, 01:15 PM

Hello;
First time poster, and relatively new to python, so please bear with me.
I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below.

I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect.

It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe?

Thanks,
Dale

Code:

from bs4 import BeautifulSoup  
from bs4 import Tag
import re
import urllib2

print 'go to Chron sites, scrape them for useful links'
baseUrl = 'http://www.chron.com'

pages = {'news' : '/news/houston-texas/', 
         'business' : '/business/', 
         'opinion': '/opinion/', 
         'sports': '/sports/'}
page_links = dict()        
        
for page in pages.keys():
    url = urllib2.urlopen(baseUrl + pages[page])
    content = url.read()
    soup = BeautifulSoup(content, "xml")
    divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')})
    links_dict = {}
    for div in divs:
        print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) 
        for element in div.descendants:
            if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10:
                if len(element.contents[0]) > 10:
                    links_dict[baseUrl + element['href']] = element.contents[0]
    page_links[page] = links_dict                
            

print 'Here is the result of the web scrape'
for page in page_links.keys():
    links_dict = page_links[page]
    for link in links_dict:
        print page, " | ", links_dict[link], " | ", link

kovidgoyal · 06-12-2013, 09:28 PM

The import statement is the same as for using lxml in any python script. The recipe system does not care what you use to parse html, all it cares is that parse_index() returns the correct data structure, as documented.

dkfurrow · 06-13-2013, 05:25 PM

Thanks for the prompt reply. Okay, I think I got it...the problem was the descendants attribute, when I used findChildren, it all worked. I posted the recipe below, will test it for awhile, then submit it for calibre inclusion...

There's no description on the the sites from which I'm obtaining the feeds, but there is a description on the feed destination...Is there any established way to handle this, other than by grabbing the text from a call within parse_index ?

Code:

#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    auto_cleanup = True
    
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('news' , '/news/houston-texas/'), 
        ('business' , '/business/'), 
        ('opinion', '/opinion/'), 
        ('sports', '/sports/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            url = urllib2.urlopen(baseUrl + page[1])
            content = url.read()
            soup = BeautifulSoup(content)
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist')})
            for div in divs:
                self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds

kovidgoyal · 06-13-2013, 10:01 PM

populate_article_metadata()

dkfurrow · 06-19-2013, 05:50 PM

Yes that works, was able to populate summary from the article body. I'd like to populate the date as well, but it appears that the soup in populate_article_metadata has already been stripped down to the basic article body, thus removing the tags I'm interested in. I tried to use the keep_only_tags feature to add the appropriate tags to the article body...didn't work. I see another poster has the same issue w/ that feature, so I'll just watch that thread. Recipe posted below:

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag

class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    auto_cleanup = True
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('news' , '/news/houston-texas/'), 
        ('business' , '/business/'), 
        ('opinion', '/opinion/'), 
        ('sports', '/sports/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            url = urllib2.urlopen(baseUrl + page[1])
            content = url.read()
            soup = BeautifulSoup(content)
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return

kovidgoyal · 06-19-2013, 11:51 PM

If you want to use keep_only_tags you have to first disable auto_cleanup.

kovidgoyal · 06-20-2013, 12:00 AM

Also note that when converting a URL to soup, use:

soup = self.index_to_soup(baseUrl + page[1])

dkfurrow · 06-25-2013, 07:09 PM

Okay, thanks very much for your help. Making progress here, code posted below...keep_only_tags is working, and the correct date is attached to each article.
I plan on removing old articles in populate_article_metadata, but haven't done that yet...I know there are some tags that still need to be removed to make things cleaner, but I have a question on a repeated error message:

Traceback (most recent call last):
File "site-packages\calibre\web\fetch\simple.py", line 431, in process_images
File "site-packages\calibre\utils\magick\__init__.py", line 132, in load
Exception: no decode delegate for this image format `' @ error/blob.c/BlobToImage/360

It doesn't halt execution, but it's difficult for me to tell where the error's coming from. Is there an attribute I can set to ignore images which can't be processed?

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz


def getRegularTimestamp(dateString):
    try:
        outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
        return outDate
    except:
        return None    

regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'

def GetDateFromString(inText):
    match = re.findall(regextest, inText)
    if match:
        try: 
            outDate = time.strptime(match[0], "%B %d, %Y")
            return outDate
        except:
            return None
    else:
        return None
    

def getTimestampFromSoup (soup):
    timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
    if timestampEle is not None:
        try:
            timestampText = timestampEle['title']
            return getRegularTimestamp(timestampText)
        except:
            return None    
    else:
        timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
        if timestampEle is not None:
            try:
                timestampText = timestampEle.string
                return GetDateFromString(timestampText)
            except:
                return None
        else:
            return None


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    #auto_cleanup = True
    # dict(name='div', attrs={'class':re.compile('toolsList')})
    #keep_only_tags = [dict(id=['content', 'heading'])]
    #auto_cleanup_keep = '//*[@class="timestamp"]|//span[@class="entry-date"]|//span[@class="post-date"]'
    
    keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), dict(name='h5', attrs={'class':re.compile('timestamp')}), dict(name='div', attrs={'id':re.compile('post-')}) ]
    
        
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('business' , '/business/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            #url = urllib2.urlopen(baseUrl + page[1])
            #content = url.read()
            soup = self.index_to_soup(baseUrl + page[1]) 
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
    
    
    
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        #self.log('printing article: ', article.title) # remove after debug
        #self.log(soup.prettify()) # remove after debug
        try:
            articleDate = getTimestampFromSoup(soup) # remove after debug
        except Exception as inst: # remove after debug
            self.log('Exception: ', article.title) # remove after debug
            self.log(type(inst)) # remove after debug
            self.log(inst) # remove after debug
        if articleDate is not None:
            dateText = time.strftime('%Y-%m-%d', articleDate)
            self.log(article.title, ' has timestamp of ', dateText)
            #self.log('Article Date is of type: ', type(article.date)) # remove after debug
            #self.log('Derived time is of type: ', type(articleDate)) # remove after debug
            try:
                article.date = articleDate
                article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
                article.localtime = article.utctime.astimezone(local_tz)
            except Exception as inst: # remove after debug
                self.log('Exception: ', article.title) # remove after debug
                self.log(type(inst)) # remove after debug
                self.log(inst) # remove after debug
        else:
            dateText = time.strftime('%Y-%m-%d', time.gmtime())
            self.log(article.title, ' has no timestamp')
            #article.date = strftime('%a, %d %b') # remove after debug
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return

kovidgoyal · 06-26-2013, 12:11 AM

Dont worry about that error, images that cannot be processed dont affect the download process, they are simply ignored. If you really want to remove them, use preprocess_html()

dkfurrow · 06-27-2013, 11:13 PM

Got the tags cleaned up tolerably well...Only thing I haven't seemed to be able to do is to delete specific articles, after parsing, based on date (line 148, I'd like to delete articles with age > 2 days). I've attached the correct date to the article in populate_article_metadata...is it possible to delete the article there? If so, what's the correct syntax? Couldn't seem to make anything work to do that.

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz
from datetime import datetime, timedelta


def getRegularTimestamp(dateString):
    try:
        outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
        return outDate
    except:
        return None    

regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'

def GetDateFromString(inText):
    match = re.findall(regextest, inText)
    if match:
        try: 
            outDate = time.strptime(match[0], "%B %d, %Y")
            return outDate
        except:
            return None
    else:
        return None

def isWithinDays(inTT,daysAgo):
    daysAgoDateTime = datetime.now()-timedelta(days = daysAgo)
    DaysAgoDateTime =  datetime(inTT[0], inTT[1], inTT[2], inTT[3], 
    inTT[4], inTT[5])
    return DaysAgoDateTime > daysAgoDateTime
    

def getTimestampFromSoup (soup):
    timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
    if timestampEle is not None:
        try:
            timestampText = timestampEle['title']
            return getRegularTimestamp(timestampText)
        except:
            return None    
    else:
        timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
        if timestampEle is not None:
            try:
                timestampText = timestampEle.string
                return GetDateFromString(timestampText)
            except:
                return None
        else:
            return None


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    remove_empty_feeds = True
    
    
    keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), 
                      dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), 
                      dict(name='h5', attrs={'class':re.compile('timestamp')}), 
                      dict(name='div', attrs={'id':re.compile('post-')}) ]
    
    
    remove_tags = [dict(name='div', attrs={'class':'socialBar'}), 
                   dict(name='div', attrs={'class':re.compile('post-commentmeta')}),
                   dict(name='div', attrs={'class':re.compile('slideshow_wrapper')})]
        
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('business' , '/business/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            #url = urllib2.urlopen(baseUrl + page[1])
            #content = url.read()
            soup = self.index_to_soup(baseUrl + page[1]) 
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
    
    
    
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        #self.log('printing article: ', article.title) # remove after debug
        #self.log(soup.prettify()) # remove after debug
        try:
            articleDate = getTimestampFromSoup(soup) # remove after debug
        except Exception as inst: # remove after debug
            self.log('Exception: ', article.title) # remove after debug
            self.log(type(inst)) # remove after debug
            self.log(inst) # remove after debug
        if articleDate is not None:
            dateText = time.strftime('%Y-%m-%d', articleDate)
            #self.log(article.title, ' has timestamp of ', dateText)
            #self.log('Article Date is of type: ', type(article.date)) # remove after debug
            #self.log('Derived time is of type: ', type(articleDate)) # remove after debug
            try:
                article.date = articleDate
                article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
                article.localtime = article.utctime.astimezone(local_tz)
                if not isWithinDays(articleDate, 2):
                    print 'Article: ', article.title, ' is more than 2 days old'
            except Exception as inst: # remove after debug
                self.log('Exception: ', article.title) # remove after debug
                self.log(type(inst)) # remove after debug
                self.log(inst) # remove after debug
        else:
            dateText = time.strftime('%Y-%m-%d', time.gmtime())
            self.log(article.title, ' has no timestamp')
            #article.date = strftime('%a, %d %b') # remove after debug
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return

dkfurrow · 07-09-2013, 11:42 AM

So, I think the probable answer to my last post is "can't be done"...i.e. if you want to exclude an article, you have to make sure that it doesn't get returned in parse_index.

Latest houston chronicle is attached--I'm comfortable with this as a submission for the next build. It's somewhat slow (>4mins on my machine), because it's parsing all article pages (with lxml) in parse_index in order to populate metadata and remove old articles. It does seem strange to me that that the date argument for the Article constructor doesn't appear to populate the finished date in the ebook--had to revisit Article.date in populate_article_metadata.

I see that the API allows saving content to a temporary file, and there's an example in LeMonde. If I have time I'll see if I can figure out how to apply that here...might speed things up a bit, but unclear to me how embedded pictures will be handled.

Would be happy to take any suggestions for improvement.
Thanks,
Dale

sup · 10-29-2013, 02:41 PM

The original question asks which version of Beautifulsoup calibre is using. It appears it is version 3. Is there a way to use version 4? If not, when will calibre start using the new version?

sup · 11-02-2013, 09:10 AM

Beautiful soup is no longer the recomended way to develop recipes, see: https://bugs.launchpad.net/calibre/+bug/1247222

10-29-2013, 02:41 PM	#12
sup Zealot Posts: 106 Karma: 10 Join Date: Sep 2013 Device: Kindle Paperwhite (2012)	BeautifulSoup version The original question asks which version of Beautifulsoup calibre is using. It appears it is version 3. Is there a way to use version 4? If not, when will calibre start using the new version?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Beautiful soup findAll doesn't seem to work	Steven630	Recipes	13	08-19-2012 02:44 AM
HTML5 parsing	nickredding	Conversion	8	08-09-2012 09:50 AM
Parsing Index	Steven630	Recipes	0	07-06-2012 04:53 AM
iPad PageList parsing using Javascript.	Oh.Danny.Boy	Apple Devices	0	05-17-2012 05:24 PM
Parsing Titles	cgraving	Calibre	3	01-17-2011 02:52 AM

06-12-2013, 09:28 PM	#2
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The import statement is the same as for using lxml in any python script. The recipe system does not care what you use to parse html, all it cares is that parse_index() returns the correct data structure, as documented.

06-13-2013, 10:01 PM	#4
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	populate_article_metadata()

06-19-2013, 11:51 PM	#6
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you want to use keep_only_tags you have to first disable auto_cleanup.

06-20-2013, 12:00 AM	#7
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Also note that when converting a URL to soup, use: soup = self.index_to_soup(baseUrl + page[1])

06-26-2013, 12:11 AM	#9
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Dont worry about that error, images that cannot be processed dont affect the download process, they are simply ignored. If you really want to remove them, use preprocess_html()

11-02-2013, 09:10 AM	#13
sup Zealot Posts: 106 Karma: 10 Join Date: Sep 2013 Device: Kindle Paperwhite (2012)	Beautiful soup is no longer the recomended way to develop recipes, see: https://bugs.launchpad.net/calibre/+bug/1247222

Advert

Advert