Get article URL in postprocess_html

rmflight · 11-29-2012, 08:48 AM

I need to do some post-processing on an html page after it is downloaded that includes finding div's that have links to html tables. The links are relative links that depend on the original URL of the article that was downloaded: this would be the "link" obtained from the actual rss feed.

Is there a way to get this URL in the postprocess_html function?

kovidgoyal · 11-29-2012, 08:55 AM

Use get_obfuscated_article()

rmflight · 11-29-2012, 09:16 AM

That doesn't seem right. I'm using this recipe, and it gets the article actually pretty nicely, without any trouble.

I basically want access to the URL that gets returned by "print_version", trim off the last bit, and then go download the link defined in the article for a table (see line 383 here for an example), and then download, soupify, extract the table, and insert into the article directly.

To do that, I need the original URL that was used to download the article. It doesn't seem like it should be hard to do.

Are you telling me to use "get_obfuscated_article()" to just write a full custom defined method for this article? It doesn't seem like I should have to do that, because as I said, some simple tweaks to the recipe seem to get 99% of the content I need just fine. I want to do this in post-processing because not necessarily every article will have tables, but many of them will have images, which the basic recipe seems to do very nicely.

rmflight · 11-29-2012, 09:33 AM

Or is there a way to tell Calibre that in addition to the image links, that I want to download the table files and do some processing on them as well?

Sorry, still new to Python in general, and the way the recipes work specifically.

kovidgoyal · 11-29-2012, 11:04 AM

There's no big deal in implementing obfuscated_article, just get the soup with self.index_to_soup passing in the url, make whatever changes you want. Save the soup to a temp file and return the path to the temp file. The rest of the recipe processing will then take place as normal on the contents of the temp file.

But if for some reason implementing obfuscated_article freaks you out, you can re-purpose populate_article_metadata to do what you want. Though IMO, that's a lot more hackish than using get_obfuscate_article.

Or you can implement preprocess_raw_article() which has the URL available.

rmflight · 11-29-2012, 11:37 AM

It doesn't really freak me out, I was just trying to figure out where and how to do it. I will give this a try when I get a chance, probably won't be for a little bit.

Thank you.

11-29-2012, 08:48 AM	#1
rmflight Junior Member Posts: 6 Karma: 10 Join Date: Nov 2012 Device: Kindle	Get article URL in postprocess_html I need to do some post-processing on an html page after it is downloaded that includes finding div's that have links to html tables. The links are relative links that depend on the original URL of the article that was downloaded: this would be the "link" obtained from the actual rss feed. Is there a way to get this URL in the postprocess_html function?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Print friendly url unrelated to regular url (and javascript)	sleepless	Recipes	3	12-03-2011 10:43 AM
How get full article when good looking page do not have print version and same url?	newnick	Recipes	2	07-08-2011 03:58 AM
get print-url and somtimes non-print-url	schuster	Recipes	4	05-28-2011 03:01 AM
Trying to strip the date from an article URL	Finbar127	Recipes	1	02-17-2011 03:02 PM
postprocess_html	marbs	Recipes	20	11-03-2010 10:11 PM

11-29-2012, 08:55 AM	#2
kovidgoyal creator of calibre Posts: 43,853 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use get_obfuscated_article()

11-29-2012, 09:16 AM	#3
rmflight Junior Member Posts: 6 Karma: 10 Join Date: Nov 2012 Device: Kindle	That doesn't seem right. I'm using this recipe, and it gets the article actually pretty nicely, without any trouble. I basically want access to the URL that gets returned by "print_version", trim off the last bit, and then go download the link defined in the article for a table (see line 383 here for an example), and then download, soupify, extract the table, and insert into the article directly. To do that, I need the original URL that was used to download the article. It doesn't seem like it should be hard to do. Are you telling me to use "get_obfuscated_article()" to just write a full custom defined method for this article? It doesn't seem like I should have to do that, because as I said, some simple tweaks to the recipe seem to get 99% of the content I need just fine. I want to do this in post-processing because not necessarily every article will have tables, but many of them will have images, which the basic recipe seems to do very nicely.

11-29-2012, 09:33 AM	#4
rmflight Junior Member Posts: 6 Karma: 10 Join Date: Nov 2012 Device: Kindle	Or is there a way to tell Calibre that in addition to the image links, that I want to download the table files and do some processing on them as well? Sorry, still new to Python in general, and the way the recipes work specifically.

11-29-2012, 11:04 AM	#5
kovidgoyal creator of calibre Posts: 43,853 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no big deal in implementing obfuscated_article, just get the soup with self.index_to_soup passing in the url, make whatever changes you want. Save the soup to a temp file and return the path to the temp file. The rest of the recipe processing will then take place as normal on the contents of the temp file. But if for some reason implementing obfuscated_article freaks you out, you can re-purpose populate_article_metadata to do what you want. Though IMO, that's a lot more hackish than using get_obfuscate_article. Or you can implement preprocess_raw_article() which has the URL available.

11-29-2012, 11:37 AM	#6
rmflight Junior Member Posts: 6 Karma: 10 Join Date: Nov 2012 Device: Kindle	It doesn't really freak me out, I was just trying to figure out where and how to do it. I will give this a try when I get a chance, probably won't be for a little bit. Thank you.

Advert

Advert