postprocess_html receives html string instead of soup

Rackamouth · 07-07-2013, 08:18 PM

Hi,
I'm developing a recipe without feeds, so I use parse_index. Everything works find, except extra_css disappears somewhere, so no styles on the end product. So instead of specifying my own styles with extra_css (the original html specifies styles based on id), I decided to replace <p id=headline>...</p> by <h1>...</h1>, <p id=quote>...</p> by <blockquote>...</blockquote> and so on. so I do

Code:

def postprocess_html(self, soup, first):
		for div in soup.findAll(id='headline'):
			div.name = 'h1'
		for div in soup.findAll(id='quote'):
			div.name = 'blockquote'

BUT soup.findAll fails because, for some reason I can't fathom, soup isn't a BeautifulSoup object but a plain string. Am I missing something???

Thanks,
TM

kovidgoyal · 07-08-2013, 01:46 AM

postprocess_html is most definitely called with soup, not raw html. Post your complete recipe, without that it's impossible to know what you are doing.

Rackamouth · 07-08-2013, 10:57 AM

Here's the recipe. If you could explain why extra_css doesn't do anything as well (when postprocess_html is commented out) that's be awesome.
Thanks!

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class WV(BasicNewsRecipe):

	title       = 'Workers Vanguard'
	__author__  = ''
	description = 'Current issue of WV'
	needs_subscription = False
	no_stylesheets = True
	extra_css = '#wvbody {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify; text-indent: .2in} #wvbodyfl {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify} #wvquote {font-size: 10pt; margin-left: 20px; margin-right: 00px; text-align: justify; margin-top: 13px; margin-bottom: 0px} #wvcite {font-size: 10pt; margin-left: 20px; margin-right: 0px; margin-top: 0px; margin-bottom: 5px} #wvdatecite {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: right} #wvbodyctr {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: center; text-indent: .2in} #headline {font-size: 20pt; font-weight: bolder; margin-bottom: 5px; margin-top: 0px; text-align: center} #kicker {font-size: 16pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #nytimes {font-size: 12pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #subhead {font-size: 11pt; font-weight: bold; text-align: left; margin-bottom: 14px; margin-top: 14px} #folio {font-size: 9pt} #smlheadline {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlkicker {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlfolio {font-size: 9pt; margin-bottom: 0px; margin-top: 0px} #smlarticletype {font-size: 7pt; margin-bottom: 0px; margin-top: 0px}'
    
    
	def print_version(self, url):
		return string.join(["http://www.spartacist.org/print/english/wv/", url],'')

	def parse_index(self):
		soup = self.index_to_soup('http://spartacist.org/english/wv/index.html')
		articles = []
		
		# get issue number and date.
		for div in soup.findAll(id='folio'):
			a = div.string
			if a:
				date = a
				print string.join(['Found date: ', date])
				self.timefmt = date
			else:
				issuenostring = div.i.findNextSibling(text=True)
				print string.join(['Found issue number string: ', issuenostring]) 
		
		# find print URL of main article in index page
		for div in soup.findAll(text=re.compile("Printable")):
			a = div.findParent('a', href=True)
			if not a: continue 
			else: 
				url1 = string.split(re.sub(r'\?.*', '', a['href']), '/')
				url = string.join([url1[-2], '/', url1[-1]],'')
				
		# find headline of main article in index page
		for div in soup.findAll(id='headline'):
			headline = div.string
			print(string.join(['Found article ', headline, 'at url', url]))
			articles.append({'title':headline, 'url':url, 'description':'', 'date':date})		
		
		# find following articles articles (parsing Table of Content at right of index page)
		for div in soup.findAll(id='smlheadline'):
			a = div.find('a', href=True)
			if not a: continue 
			else: 
				url = re.sub(r'\?.*', '', a['href'])
				headline = a.string
				print(string.join(['Found article', headline, 'at url', url]))
				articles.append({'title':headline, 'url':url, 'description':'', 'date':''})
				
		return [(string.join(['WV', issuenostring], ''), articles)]
	
	# Replace id-based styling by tag-based standard styling
	def postprocess_html(self, soup, first):
		print soup
		for div in soup.findAll(id='headline'):
			div.name = 'h1'
		for div in soup.findAll(id='kicker'):
			div.name = 'h2'
		for div in soup.findAll(id='subhead'):
			div.name = 'h3'
		for div in soup.findAll(id='wvquote'):
			div.name = 'blockquote'
		for div in soup.findAll(id='wvcite'):
			div.name = 'blockquote'

Rackamouth · 07-08-2013, 08:29 PM

About the missing extra_css -- it is properly included in the input, parsed and structure debug files, but at the processed stage the css are renamed to calibreX whereas the original id remain in the html articles.

kovidgoyal · 07-09-2013, 01:10 AM

The soup is not html, when you print it it is automatically converted to html. And youneed to have a return soup at the end.

Rackamouth · 07-09-2013, 02:43 PM

Oh OK... I got a error with a findAll, which I assumed came from inside postprocess_html, but it was actually later on... Should have looked at the trace a bit more carefully...

Anyway I got a nice epub now.

HOWEVER, if I use .mobi instead of .epub, all the articles but one disappear from the final ebook!!! Can you please look into that?

Thanks a lot,

TM.

07-07-2013, 08:18 PM	#1
Rackamouth Junior Member Posts: 6 Karma: 10 Join Date: Jun 2013 Device: Kindle Touch	postprocess_html receives html string instead of soup Hi, I'm developing a recipe without feeds, so I use parse_index. Everything works find, except extra_css disappears somewhere, so no styles on the end product. So instead of specifying my own styles with extra_css (the original html specifies styles based on id), I decided to replace <p id=headline>...</p> by <h1>...</h1>, <p id=quote>...</p> by <blockquote>...</blockquote> and so on. so I do Code: def postprocess_html(self, soup, first): for div in soup.findAll(id='headline'): div.name = 'h1' for div in soup.findAll(id='quote'): div.name = 'blockquote' BUT soup.findAll fails because, for some reason I can't fathom, soup isn't a BeautifulSoup object but a plain string. Am I missing something??? Thanks, TM

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Get article URL in postprocess_html	rmflight	Recipes	5	11-29-2012 12:37 PM
Mathch a string while ignoring some character in that string?	ElMiko	Sigil	12	12-01-2011 11:05 PM
postprocess_html	marbs	Recipes	20	11-03-2010 11:11 PM
Nemoptic's Sylen receives 2m Euro grant	gnuuurff	News	7	12-08-2007 07:56 AM
Pepper Pad receives a Plus upgrade	Colin Dunstan	Alternative Devices	2	01-08-2006 06:56 PM

07-08-2013, 01:46 AM	#2
kovidgoyal creator of calibre Posts: 45,699 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	postprocess_html is most definitely called with soup, not raw html. Post your complete recipe, without that it's impossible to know what you are doing.

07-08-2013, 08:29 PM	#4
Rackamouth Junior Member Posts: 6 Karma: 10 Join Date: Jun 2013 Device: Kindle Touch	About the missing extra_css -- it is properly included in the input, parsed and structure debug files, but at the processed stage the css are renamed to calibreX whereas the original id remain in the html articles.

07-09-2013, 01:10 AM	#5
kovidgoyal creator of calibre Posts: 45,699 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The soup is not html, when you print it it is automatically converted to html. And youneed to have a return soup at the end.

07-09-2013, 02:43 PM	#6
Rackamouth Junior Member Posts: 6 Karma: 10 Join Date: Jun 2013 Device: Kindle Touch	Oh OK... I got a error with a findAll, which I assumed came from inside postprocess_html, but it was actually later on... Should have looked at the trace a bit more carefully... Anyway I got a nice epub now. HOWEVER, if I use .mobi instead of .epub, all the articles but one disappear from the final ebook!!! Can you please look into that? Thanks a lot, TM.

Advert

Advert