![]() |
#1 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
|
postprocess_html receives html string instead of soup
Hi,
I'm developing a recipe without feeds, so I use parse_index. Everything works find, except extra_css disappears somewhere, so no styles on the end product. So instead of specifying my own styles with extra_css (the original html specifies styles based on id), I decided to replace <p id=headline>...</p> by <h1>...</h1>, <p id=quote>...</p> by <blockquote>...</blockquote> and so on. so I do Code:
def postprocess_html(self, soup, first): for div in soup.findAll(id='headline'): div.name = 'h1' for div in soup.findAll(id='quote'): div.name = 'blockquote' Thanks, TM |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,265
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
postprocess_html is most definitely called with soup, not raw html. Post your complete recipe, without that it's impossible to know what you are doing.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
|
Here's the recipe. If you could explain why extra_css doesn't do anything as well (when postprocess_html is commented out) that's be awesome.
Thanks! Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class WV(BasicNewsRecipe): title = 'Workers Vanguard' __author__ = '' description = 'Current issue of WV' needs_subscription = False no_stylesheets = True extra_css = '#wvbody {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify; text-indent: .2in} #wvbodyfl {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 5px; margin-top: 5px; text-align: justify} #wvquote {font-size: 10pt; margin-left: 20px; margin-right: 00px; text-align: justify; margin-top: 13px; margin-bottom: 0px} #wvcite {font-size: 10pt; margin-left: 20px; margin-right: 0px; margin-top: 0px; margin-bottom: 5px} #wvdatecite {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: right} #wvbodyctr {font-size: 11pt; margin-left: 0px; margin-right: 0px; margin-bottom: 13px; margin-top: 13px; text-align: center; text-indent: .2in} #headline {font-size: 20pt; font-weight: bolder; margin-bottom: 5px; margin-top: 0px; text-align: center} #kicker {font-size: 16pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #nytimes {font-size: 12pt; font-weight: bold; margin-bottom: 5px; margin-top: 0px; text-align: center} #subhead {font-size: 11pt; font-weight: bold; text-align: left; margin-bottom: 14px; margin-top: 14px} #folio {font-size: 9pt} #smlheadline {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlkicker {font-size: 9pt; font-weight: bold; margin-bottom: 0px; margin-top: 0px} #smlfolio {font-size: 9pt; margin-bottom: 0px; margin-top: 0px} #smlarticletype {font-size: 7pt; margin-bottom: 0px; margin-top: 0px}' def print_version(self, url): return string.join(["http://www.spartacist.org/print/english/wv/", url],'') def parse_index(self): soup = self.index_to_soup('http://spartacist.org/english/wv/index.html') articles = [] # get issue number and date. for div in soup.findAll(id='folio'): a = div.string if a: date = a print string.join(['Found date: ', date]) self.timefmt = date else: issuenostring = div.i.findNextSibling(text=True) print string.join(['Found issue number string: ', issuenostring]) # find print URL of main article in index page for div in soup.findAll(text=re.compile("Printable")): a = div.findParent('a', href=True) if not a: continue else: url1 = string.split(re.sub(r'\?.*', '', a['href']), '/') url = string.join([url1[-2], '/', url1[-1]],'') # find headline of main article in index page for div in soup.findAll(id='headline'): headline = div.string print(string.join(['Found article ', headline, 'at url', url])) articles.append({'title':headline, 'url':url, 'description':'', 'date':date}) # find following articles articles (parsing Table of Content at right of index page) for div in soup.findAll(id='smlheadline'): a = div.find('a', href=True) if not a: continue else: url = re.sub(r'\?.*', '', a['href']) headline = a.string print(string.join(['Found article', headline, 'at url', url])) articles.append({'title':headline, 'url':url, 'description':'', 'date':''}) return [(string.join(['WV', issuenostring], ''), articles)] # Replace id-based styling by tag-based standard styling def postprocess_html(self, soup, first): print soup for div in soup.findAll(id='headline'): div.name = 'h1' for div in soup.findAll(id='kicker'): div.name = 'h2' for div in soup.findAll(id='subhead'): div.name = 'h3' for div in soup.findAll(id='wvquote'): div.name = 'blockquote' for div in soup.findAll(id='wvcite'): div.name = 'blockquote' |
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
|
About the missing extra_css -- it is properly included in the input, parsed and structure debug files, but at the processed stage the css are renamed to calibreX whereas the original id remain in the html articles.
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,265
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The soup is not html, when you print it it is automatically converted to html. And youneed to have a return soup at the end.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jun 2013
Device: Kindle Touch
|
Oh OK... I got a error with a findAll, which I assumed came from inside postprocess_html, but it was actually later on... Should have looked at the trace a bit more carefully...
Anyway I got a nice epub now. HOWEVER, if I use .mobi instead of .epub, all the articles but one disappear from the final ebook!!! Can you please look into that? Thanks a lot, TM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Get article URL in postprocess_html | rmflight | Recipes | 5 | 11-29-2012 11:37 AM |
Mathch a string while ignoring some character in that string? | ElMiko | Sigil | 12 | 12-01-2011 10:05 PM |
postprocess_html | marbs | Recipes | 20 | 11-03-2010 10:11 PM |
Nemoptic's Sylen receives 2m Euro grant | gnuuurff | News | 7 | 12-08-2007 06:56 AM |
Pepper Pad receives a Plus upgrade | Colin Dunstan | Alternative Devices | 2 | 01-08-2006 05:56 PM |