Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-15-2011, 04:18 AM   #1
DarkElf
Junior Member
DarkElf began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2011
Device: Kindle 3
skip_ad_pages & bad image links

Hi,
I have a problem with the skip_ad_pages method.
The feed I want to parse returns me a "wrong" article URL like
"http://bad/advertisement/page/story01.htm"
which refers to an advertisement page containing the right article URL like
"http://right/article/url/article.shtml"

I use the skip_ad_pages method to get the right page and it works except for img links in the real page.
Calibre prepend the wrong article URL to all the img tag which have "src" attribute like "path/to/image.jpg" so that the final image URL is
"http://bad/advertisement/page/path/to/image.jpg"
and not
"http://right/article/url/path/to/image.jpg"

This causes calibre fail when it tries fetching the image because it follows the wrong link.
Which is the best way to solve this?

Thankyou all in advance
DarkElf is offline   Reply With Quote
Old 04-15-2011, 09:20 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by DarkElf View Post
This causes calibre fail when it tries fetching the image because it follows the wrong link.
Which is the best way to solve this?
I have no idea of the "best" way. I'd probably use postprocess_html (as I understand skip_ad_pages is after preprocessing), then findAll the image links and fix them.
Starson17 is offline   Reply With Quote
Advert
Old 04-15-2011, 09:51 AM   #3
DarkElf
Junior Member
DarkElf began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2011
Device: Kindle 3
That is one of the solution I was thinking about...but how can I "forward" the correct URL to the postprocess method?
I think the problem is that calibre maintain the wrong url to the article in its parsed "internal" feed/article structure and doesn't replace it with the correct URL.
Is there a way to perform this sobstitution? Perhaps just in the skip_ad_pages method itself?
DarkElf is offline   Reply With Quote
Old 04-15-2011, 10:01 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by DarkElf View Post
Is there a way to perform this sobstitution? Perhaps just in the skip_ad_pages method itself?
Perhaps, but I'd need to dig deeper than I have time for.

I've never needed skip_ad_pages, so I'm not familiar with it, and it's only used in two other builtin recipes I know of. I'm a bit surprised that you are finding this, as I would have expected it to have been seen in those other recipes.
FYI, here's the code for those other recipes:
Spoiler:
Code:
    def skip_ad_pages(self, soup):
        # Skip ad pages served before actual article
        skip_tag = soup.find(True, {'name':'skip'})
        if skip_tag is not None:
            self.log.warn("Found forwarding link: %s" % skip_tag.parent['href'])
            url = 'http://www.nytimes.com' + re.sub(r'\?.*', '', skip_tag.parent['href'])
            url += '?pagewanted=all'
            self.log.warn("Skipping ad to article at '%s'" % url)
            return self.index_to_soup(url, raw=True)
Code:
    def skip_ad_pages(self, soup):
        # Skip ad pages served before actual article
        skip_tag = soup.find(name='img', attrs={'alt':'Cyanide and Happiness, a daily webcomic'})
        if skip_tag is None:
            return soup
        return None

Whatever solution you find, post it here.
Starson17 is offline   Reply With Quote
Old 04-15-2011, 10:33 AM   #5
DarkElf
Junior Member
DarkElf began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2011
Device: Kindle 3
I think I didn't explain clearly what is the problem.
I altrady use the skip_ad_pages method in my recipe the same way of the first code you quoted and it works except for the image link...so texts are fetched correctly but not images.
I'm looking for something to replace the article url calibre has with the correct url because the final html is the right html code but the article url in the internal structure is still the wrong url. For this reason (I think) calibre prepend the wrong link to build the image source url.
I don't know if it is possible inside the skip_ad_pages method because it takes only a "soup" and return only a "soup".
I don't care to do this inside the skip_ad_pages method, but I don't know which is the method I can use to do this replacement and how to do that.
I hope it is clearer now...my english is very rusty...
DarkElf is offline   Reply With Quote
Advert
Old 04-15-2011, 10:57 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by DarkElf View Post
I think I didn't explain clearly what is the problem.
I think I understood it the first time. (If I didn't, then I still don't understand). Why do you think images worked with those two recipes? Do they not use relative links for images, or is there something different about your site? I'm just curious to know the answer to this, even if it doesn't help you.

I understand you'd like to know how to change the internal base url so that relative urls for images work correctly after the ad page is skipped. I don't know the answer, but I posted how I'd try. I'd change relative links for images to full links with postprocess_html, so the internal base url should be irrelevant. You asked how to pass the correct part. I'd have to think about it. Is it available in the soup of the page? If not, didn't you have it in skip_ad_pages method?
Starson17 is offline   Reply With Quote
Old 04-16-2011, 05:32 AM   #7
DarkElf
Junior Member
DarkElf began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2011
Device: Kindle 3
Quote:
I think I understood it the first time. (If I didn't, then I still don't understand). Why do you think images worked with those two recipes? Do they not use relative links for images, or is there something different about your site? I'm just curious to know the answer to this, even if it doesn't help you.
Perhaps (I haven't check) those recipes have the images with absolute link, but my site has with relative link. Anyway I don't know if images worked with those recipes...

Quote:
I understand you'd like to know how to change the internal base url so that relative urls for images work correctly after the ad page is skipped. I don't know the answer, but I posted how I'd try. I'd change relative links for images to full links with postprocess_html, so the internal base url should be irrelevant. You asked how to pass the correct part. I'd have to think about it. Is it available in the soup of the page?
No, it is not available in the final (correct) soup.

Quote:
If not, didn't you have it in skip_ad_pages method?
Yes, I have it in the soup of the first (wrong) page, therefore in the skip_ad_pages method.

Anyway, in the meantime I found two workarounds which solve my problem.
The first:
I discover that the final correct link is also available in the feed page, but inside the "guid" tag and not the "link" tag so I override the get_artcile_url method to extract directly the correct link, with no need to use skip_ad_pages.

The second:
With a sort of easy "reverse engineering" I understand the method to parse/decode the wrong link obtaining the right link, again overriding the get_artcile_url method.

In those ways image works...
DarkElf is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
SD card firmware (1.4) image links for AU/NZ sabredog Kobo Reader 42 03-27-2015 03:11 AM
Links: Image replacement methods ckirchho ePub 1 10-22-2012 04:54 AM
skip_ad_pages & nmassage bubak Recipes 1 04-13-2011 05:00 PM
Converting to Mobi ignores image links atjnjk Conversion 0 03-10-2011 09:03 PM
Firmware Update Bad Image Refresh and Settings after 2.3 update Insomnic Amazon Kindle 6 04-01-2010 11:59 AM


All times are GMT -4. The time now is 09:59 PM.


MobileRead.com is a privately owned, operated and funded community.