MobileRead Forums - View Single Post - Calibre Recipe HTML content differs from raw html of index.html.

krunk · 09-16-2010, 02:00 PM

First, many thanks for this excellent piece of software. I'm just learning to fully take advantage of its features.

The Question, then an explanation.

> Why would calibre be receiving different html content than wget or or my browser (with javascript disabled)?

Explanation:

I'm writing a custom recipe for a blog I like, www.worldofweirdthings.com.

It should be pretty basic, just want to pull down the daily article and strip some of the social widget images out.

I started with the auto-generated recipe. Which looks like:

Code:

# vim:ft=python
from calibre.web.feeds.recipes import BasicNewsRecipe

class WorldofWeirdThingsBlog(BasicNewsRecipe):
    title          = u'World of Weird Things'
    oldest_article = 1
    max_articles_per_feed = 5

    feeds = [(u'World of Weird Things', 
                  u'feed://worldofweirdthings.com/feed/')]

Inspecting the html on the site, I see that the social widget image I wish to remove is in a div container. e.g.

Code:

<div class="addtoany_share_save_container">
  <ul class="addtoany_list">
    <li>
    <a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://worldofweirdthings.com/wp-content/plugins/add-to-any/share_save_256_24.png" width="256" height="24" alt="Share/Bookmark"/></a>
    </li>
  </ul>
  <script type="text/javascript"><!--
  var a2a_config = a2a_config || {};
  a2a_config.linkname="could a geocentrist even screw in a lightbulb?";
  a2a_config.linkurl="http://worldofweirdthings.com/2010/09/16/could-a-geocentrist-even-screw-in-a-lightbulb/";
  a2a_config.show_title=1;
  a2a_color_main="ffffff";a2a_color_border="000000";a2a_color_link_text="3c607b";a2a_color_link_text_hover="1c2a36";a2a_color_bg="ffffff";
  //-->
  </script>
  <script type="text/javascript" src="http://static.addtoany.com/menu/page.js"></script>
</div>

So I add a remove_tags directive, now my recipe looks like:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe

class WorldofWeirdThingsBlog(BasicNewsRecipe):
    title          = u'World of Weird Things'
    oldest_article = 1
    max_articles_per_feed = 5
    remove_tags    = [dict(name='div', attrs={'class':'addtoany_share_save_container'})]

    feeds = [(u'World of Weird Things', u'feed://worldofweirdthings.com/feed/')]

After fetching the feed with:

Code:

ebook-convert my.recipe blog.mobi

The image was still there. So, I decided to print out the html that was actually being parsed in the script.

Code:

def preprocess_html(self, soup):
    print(soup.prettify())

The html that was printed had did not have the tag, but still had the image….outside the tag. It was almost as if the containing div were stripped but the URL was left in tact. This is what I saw at the bottom.

Code:

  <p>
    <a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save">
     <img src="http://worldofweirdthings.com/wp-content/plugins/add-to-any/share_save_256_24.png" width="256" height="24" alt="Share/Bookmark" />
    </a>
   </p>

I found this odd, since the documentation says this method should provide the raw HTML before any stripping has been done.