Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-03-2011, 05:56 PM   #1
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Bad DOCTYPE declaration causes BS to crash

After some investigation, I discover that this DOCTPE declaration is causing my recipe to fail:

Code:
<!DOCTYPE html 
	PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
As you can see, there's an erroneous quote after PUBLIC.

So far, I've tried to solve the matter with this:

Code:
preprocess_regexps = [
(re.compile(r'<!DOCTYPE html .*strict.dtd">', re.DOTALL|re.IGNORECASE),
lambda match: '<!DOCTYPE html>'),
]
and this:

Code:
    def parse_declaration(self, i):
        """Treat a bogus SGML declaration as raw data. Treat a CDATA
        declaration as a CData object."""
        j = None
        if self.rawdata[i:i+9] == '<![CDATA[':
             k = self.rawdata.find(']]>', i)
             if k == -1:
                 k = len(self.rawdata)
             data = self.rawdata[i+9:k]
             j = k+3
             self._toStringSubclass(data, CData)
        else:
            try:
                j = SGMLParser.parse_declaration(self, i)
            except SGMLParseError:
                # Could not parse the DOCTYPE declaration
                # Try to just skip the actual declaration
                match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
                re.MULTILINE)
                if match:
                    toHandle = self.rawdata[i:match.end()]
                else:
                    toHandle = self.rawdata[i:]
                self.handle_data(toHandle)
                j = i + len(toHandle)
        return j
But the result's the same:

Quote:
Python function terminated unexpectedly
No articles found, aborting (Error Code: 1)
Any ideas?
macpablus is offline   Reply With Quote
Old 09-03-2011, 07:44 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Try this

preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')]

and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps.
kovidgoyal is offline   Reply With Quote
Old 09-04-2011, 02:26 AM   #3
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Thanks, Kovid! (for everything).

Quote:
Originally Posted by kovidgoyal View Post
Try this

preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), '')]

No luck with that, either. But perhaps I'm nout using it the proper way...
Quote:
...and note that you can also define preprocess_raw_html() i your recipe to remove the doctype programmitacally if you have trouble with regeps.
Mmmmmm... not sure how to use it exactly, and unfortunately I didn't find any example in the built-in recipes.

Maybe I shoud clarify that until today I have zero experiencie with recipes, and only know something about HTML and Javascript. But I manage to make the recipe work with a local file, removing manually the DOCTYPE declaration in the index file.

BTW, here's the recipe:

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
pagina12.com.ar
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina12'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    #INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    INDEX = 'file:///C:/Archivos%20de%20programa/Calibre2/pagina12.htm'
    language = 'es'
    encoding              = 'cp1252'
    remove_tags_before = dict(id='fecha')	
    remove_tags_after  = dict(id='fin')
    remove_tags        = [dict(id=['fecha', 'fin', 'pageControls'])]
    no_stylesheets = True

    #preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
	
    #preprocess_regexps = [(re.compile(r'<!DOCTYPE.*dtd">', re.IGNORECASE),lambda x: '<!DOCTYPE html> '),]

    #def print_version(self, url):
    #    return url.replace('/archive/', '/print/')

    def parse_index(self):
        articles = []
        numero = 1
        soup = self.index_to_soup(self.INDEX)
        ts = soup.find(id='magazineTopStories')
        #ds = self.tag_to_string(ts.find('h1')).split(':')[-1]
        #self.timefmt = ' [%s]'%ds

        cover = soup.find('img', src=True, attrs={'class':'cover'})
        if cover is not None:
            self.cover_url = cover['src']

        feeds = []
        #feeds.append((u'ULTIMAS NOTICIAS',u'http://www.pagina12.com.ar/diario/rss/ultimas_noticias.xml'))		
        seen_titles = set([])
        for section in soup.findAll('div','seccionx'):
            numero+=1
            print (numero)
            section_title = self.tag_to_string(section.find('div','desplegable_titulo on_principal right'))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('h2'):
                h = post.find('a', href=True)
                title = self.tag_to_string(h)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                a = post.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://pagina12.com.ar/imprimir'+url
                p = post.find('div', attrs={'h2'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))                       
        return feeds

    
    def postprocess_html(self, soup, first):
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup


If you could tell where and how to try your suggestions...

Meanwhile, I wrote to the webmaster's newspaper about the mistake. No answer as for today. ;-(
macpablus is offline   Reply With Quote
Old 09-04-2011, 02:30 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Just stick the regexp in your recipe as

Code:
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]
That should strip any doctype declarations from downloaded HTML.
kovidgoyal is offline   Reply With Quote
Old 09-04-2011, 02:30 PM   #5
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by kovidgoyal View Post
Just stick the regexp in your recipe as

Code:
preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]
That should strip any doctype declarations from downloaded HTML.
Didn't work. "Downloaded HTML" includes the index file?. 'Cause that's the one causing the problem, in fact.
macpablus is offline   Reply With Quote
Old 09-04-2011, 02:44 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It includes all downloaded html.
kovidgoyal is offline   Reply With Quote
Old 09-04-2011, 03:01 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sorry, if you mean the index page as in the page used in parse_index, then no it doesn't apply. In that case you have to do it manually.

Code:
raw = self.index_to_soup(index_url, raw=True)
raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw)
soup = self.index_to_soup(raw)
kovidgoyal is offline   Reply With Quote
Old 09-04-2011, 03:41 PM   #8
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3


SOLVED!

I'll post the new recipe in the appropiate section.
macpablus is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Proper Unicode Declaration Fabe Sigil 9 10-13-2010 01:42 PM
Namespace declaration ChrisI Sigil 1 08-22-2010 06:02 AM
Encoding declaration in OPF and TOC? paulpeer Sigil 7 03-08-2010 03:48 PM
Declaration of Independence bill the smith News 140 10-02-2009 05:01 PM
Government United States: Declaration of Independence etc, v1, 21 Oct 2007. Patricia BBeB/LRF Books 2 10-21-2007 09:37 PM


All times are GMT -4. The time now is 02:56 PM.


MobileRead.com is a privately owned, operated and funded community.