Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-16-2010, 02:00 PM   #1
krunk
Member
krunk began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2010
Location: Los Angeles, CA
Device: Kindle 3
Calibre Recipe HTML content differs from raw html of index.html.

First, many thanks for this excellent piece of software. I'm just learning to fully take advantage of its features.

The Question, then an explanation.

> Why would calibre be receiving different html content than wget or or my browser (with javascript disabled)?

Explanation:

I'm writing a custom recipe for a blog I like, www.worldofweirdthings.com.

It should be pretty basic, just want to pull down the daily article and strip some of the social widget images out.

I started with the auto-generated recipe. Which looks like:
Code:
# vim:ft=python
from calibre.web.feeds.recipes import BasicNewsRecipe

class WorldofWeirdThingsBlog(BasicNewsRecipe):
    title          = u'World of Weird Things'
    oldest_article = 1
    max_articles_per_feed = 5

    feeds = [(u'World of Weird Things', 
                  u'feed://worldofweirdthings.com/feed/')]
Inspecting the html on the site, I see that the social widget image I wish to remove is in a div container. e.g.

Code:
<div class="addtoany_share_save_container">
  <ul class="addtoany_list">
    <li>
    <a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://worldofweirdthings.com/wp-content/plugins/add-to-any/share_save_256_24.png" width="256" height="24" alt="Share/Bookmark"/></a>
    </li>
  </ul>
  <script type="text/javascript"><!--
  var a2a_config = a2a_config || {};
  a2a_config.linkname="could a geocentrist even screw in a lightbulb?";
  a2a_config.linkurl="http://worldofweirdthings.com/2010/09/16/could-a-geocentrist-even-screw-in-a-lightbulb/";
  a2a_config.show_title=1;
  a2a_color_main="ffffff";a2a_color_border="000000";a2a_color_link_text="3c607b";a2a_color_link_text_hover="1c2a36";a2a_color_bg="ffffff";
  //-->
  </script>
  <script type="text/javascript" src="http://static.addtoany.com/menu/page.js"></script>
</div>
So I add a remove_tags directive, now my recipe looks like:

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

class WorldofWeirdThingsBlog(BasicNewsRecipe):
    title          = u'World of Weird Things'
    oldest_article = 1
    max_articles_per_feed = 5
    remove_tags    = [dict(name='div', attrs={'class':'addtoany_share_save_container'})]

    feeds = [(u'World of Weird Things', u'feed://worldofweirdthings.com/feed/')]
After fetching the feed with:

Code:
ebook-convert my.recipe blog.mobi
The image was still there. So, I decided to print out the html that was actually being parsed in the script.

Code:
def preprocess_html(self, soup):
    print(soup.prettify())

The html that was printed had did not have the tag, but still had the image….outside the tag. It was almost as if the containing div were stripped but the URL was left in tact. This is what I saw at the bottom.

Code:
  <p>
    <a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save">
     <img src="http://worldofweirdthings.com/wp-content/plugins/add-to-any/share_save_256_24.png" width="256" height="24" alt="Share/Bookmark" />
    </a>
   </p>
I found this odd, since the documentation says this method should provide the raw HTML before any stripping has been done.

Last edited by krunk; 09-17-2010 at 04:56 PM.
krunk is offline   Reply With Quote
Old 09-16-2010, 04:31 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by krunk View Post
remove_tags = [dict(name='a', attrs={'class':'addtoany_share_save_container'})]
It looks to me like this won't remove what you want. The tag with class 'addtoany_share_save_container' is a <div> tag, not an <a> tag. I'm not sure if you want the <div> or the <a>, but the class has to be in the tag you're removing, and the <a> tag has class '"a2a_dd addtoany_share_save"

BTW, usually the recipe discussions go on in the Custom Recipe thread: https://www.mobileread.com/forums/sho...32543&page=183
It's inhabited by all the recipe junkies.

Last edited by Starson17; 09-16-2010 at 04:33 PM.
Starson17 is offline   Reply With Quote
Advert
Old 09-17-2010, 04:57 PM   #3
krunk
Member
krunk began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2010
Location: Los Angeles, CA
Device: Kindle 3
Quote:
Originally Posted by Starson17 View Post
It looks to me like this won't remove what you want. The tag with class 'addtoany_share_save_container' is a <div> tag, not an <a> tag. I'm not sure if you want the <div> or the <a>, but the class has to be in the tag you're removing, and the <a> tag has class '"a2a_dd addtoany_share_save"

BTW, usually the recipe discussions go on in the Custom Recipe thread: https://www.mobileread.com/forums/sho...32543&page=183
It's inhabited by all the recipe junkies.
Oops, that's a typo. I corrected it. The key is that the div appears in the html, but does not appear in the soup passed to preprocess_html().

I'll check that thread, things tend to get lost in epic threads though.
krunk is offline   Reply With Quote
Old 09-17-2010, 08:56 PM   #4
capidamonte
Not who you think I am...
capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.
 
capidamonte's Avatar
 
Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
There should be a recipe sub-forum.
capidamonte is offline   Reply With Quote
Old 09-20-2010, 09:48 PM   #5
krunk
Member
krunk began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2010
Location: Los Angeles, CA
Device: Kindle 3
Quote:
Originally Posted by capidamonte View Post
There should be a recipe sub-forum.
I think that'd be great.
krunk is offline   Reply With Quote
Advert
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Merging multiple HTML files into one HTML file skoobwoman Workshop 45 07-11-2014 10:46 AM
HTML to Mobi conversions (DocBook XSL, and content.opf?) AndrewLB Calibre 3 09-04-2010 09:02 PM
New to Calibre - Recipe/HTML question ClairePMR Calibre 3 07-23-2010 11:53 AM
HTML Book + non HTML TOC to epub aarcane Calibre 4 03-02-2010 02:58 AM
Calibre ... or bad HTML? FizzyWater Calibre 1 07-20-2008 06:57 PM


All times are GMT -4. The time now is 06:44 AM.


MobileRead.com is a privately owned, operated and funded community.