Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-07-2013, 07:18 PM   #1
Rackamouth
Junior Member
Rackamouth began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
postprocess_html receives html string instead of soup

Hi,
I'm developing a recipe without feeds, so I use parse_index. Everything works find, except extra_css disappears somewhere, so no styles on the end product. So instead of specifying my own styles with extra_css (the original html specifies styles based on id), I decided to replace <p id=headline>...</p> by <h1>...</h1>, <p id=quote>...</p> by <blockquote>...</blockquote> and so on. so I do
Code:
def postprocess_html(self, soup, first):
		for div in soup.findAll(id='headline'):
			div.name = 'h1'
		for div in soup.findAll(id='quote'):
			div.name = 'blockquote'
BUT soup.findAll fails because, for some reason I can't fathom, soup isn't a BeautifulSoup object but a plain string. Am I missing something???

Thanks,
TM
Rackamouth is offline   Reply With Quote
Old 07-08-2013, 12:46 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,265
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
postprocess_html is most definitely called with soup, not raw html. Post your complete recipe, without that it's impossible to know what you are doing.
kovidgoyal is online now   Reply With Quote
Advert
Old 07-08-2013, 09:57 AM   #3
Rackamouth
Junior Member
Rackamouth began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
Here's the recipe. If you could explain why extra_css doesn't do anything as well (when postprocess_html is commented out) that's be awesome.
Thanks!
Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class WV(BasicNewsRecipe):

	title       = 'Workers Vanguard'
	__author__  = ''
	description = 'Current issue of WV'
	needs_subscription = False
	no_stylesheets = True
	extra_css = '#wvbody {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify; text-indent: .2in} #wvbodyfl {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify} #wvquote {font-size: 10pt; margin-left: 20px; margin-right: 00px; text-align: justify; margin-top: 13px; margin-bottom: 0px} #wvcite {font-size: 10pt; margin-left: 20px; margin-right: 0px; margin-top: 0px; margin-bottom: 5px} #wvdatecite {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: right} #wvbodyctr {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: center; text-indent: .2in} #headline {font-size: 20pt; font-weight: bolder; margin-bottom: 5px; margin-top: 0px; text-align: center} #kicker {font-size: 16pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #nytimes {font-size: 12pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #subhead {font-size: 11pt; font-weight: bold; text-align: left; margin-bottom: 14px; margin-top: 14px} #folio {font-size: 9pt} #smlheadline {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlkicker {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlfolio {font-size: 9pt; margin-bottom: 0px; margin-top: 0px} #smlarticletype {font-size: 7pt; margin-bottom: 0px; margin-top: 0px}'
    
    
	def print_version(self, url):
		return string.join(["http://www.spartacist.org/print/english/wv/", url],'')

	def parse_index(self):
		soup = self.index_to_soup('http://spartacist.org/english/wv/index.html')
		articles = []
		
		# get issue number and date.
		for div in soup.findAll(id='folio'):
			a = div.string
			if a:
				date = a
				print string.join(['Found date: ', date])
				self.timefmt = date
			else:
				issuenostring = div.i.findNextSibling(text=True)
				print string.join(['Found issue number string: ', issuenostring]) 
		
		# find print URL of main article in index page
		for div in soup.findAll(text=re.compile("Printable")):
			a = div.findParent('a', href=True)
			if not a: continue 
			else: 
				url1 = string.split(re.sub(r'\?.*', '', a['href']), '/')
				url = string.join([url1[-2], '/', url1[-1]],'')
				
		# find headline of main article in index page
		for div in soup.findAll(id='headline'):
			headline = div.string
			print(string.join(['Found article ', headline, 'at url', url]))
			articles.append({'title':headline, 'url':url, 'description':'', 'date':date})		
		
		# find following articles articles (parsing Table of Content at right of index page)
		for div in soup.findAll(id='smlheadline'):
			a = div.find('a', href=True)
			if not a: continue 
			else: 
				url = re.sub(r'\?.*', '', a['href'])
				headline = a.string
				print(string.join(['Found article', headline, 'at url', url]))
				articles.append({'title':headline, 'url':url, 'description':'', 'date':''})
				
		return [(string.join(['WV', issuenostring], ''), articles)]
	
	# Replace id-based styling by tag-based standard styling
	def postprocess_html(self, soup, first):
		print soup
		for div in soup.findAll(id='headline'):
			div.name = 'h1'
		for div in soup.findAll(id='kicker'):
			div.name = 'h2'
		for div in soup.findAll(id='subhead'):
			div.name = 'h3'
		for div in soup.findAll(id='wvquote'):
			div.name = 'blockquote'
		for div in soup.findAll(id='wvcite'):
			div.name = 'blockquote'
Rackamouth is offline   Reply With Quote
Old 07-08-2013, 07:29 PM   #4
Rackamouth
Junior Member
Rackamouth began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
About the missing extra_css -- it is properly included in the input, parsed and structure debug files, but at the processed stage the css are renamed to calibreX whereas the original id remain in the html articles.
Rackamouth is offline   Reply With Quote
Old 07-09-2013, 12:10 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,265
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The soup is not html, when you print it it is automatically converted to html. And youneed to have a return soup at the end.
kovidgoyal is online now   Reply With Quote
Advert
Old 07-09-2013, 01:43 PM   #6
Rackamouth
Junior Member
Rackamouth began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
Oh OK... I got a error with a findAll, which I assumed came from inside postprocess_html, but it was actually later on... Should have looked at the trace a bit more carefully...

Anyway I got a nice epub now.

HOWEVER, if I use .mobi instead of .epub, all the articles but one disappear from the final ebook!!! Can you please look into that?

Thanks a lot,

TM.
Rackamouth is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Get article URL in postprocess_html rmflight Recipes 5 11-29-2012 11:37 AM
Mathch a string while ignoring some character in that string? ElMiko Sigil 12 12-01-2011 10:05 PM
postprocess_html marbs Recipes 20 11-03-2010 10:11 PM
Nemoptic's Sylen receives 2m Euro grant gnuuurff News 7 12-08-2007 06:56 AM
Pepper Pad receives a Plus upgrade Colin Dunstan Alternative Devices 2 01-08-2006 05:56 PM


All times are GMT -4. The time now is 07:26 AM.


MobileRead.com is a privately owned, operated and funded community.