11-29-2012, 08:48 AM | #1 |
Junior Member
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
|
Get article URL in postprocess_html
I need to do some post-processing on an html page after it is downloaded that includes finding div's that have links to html tables. The links are relative links that depend on the original URL of the article that was downloaded: this would be the "link" obtained from the actual rss feed.
Is there a way to get this URL in the postprocess_html function? |
11-29-2012, 08:55 AM | #2 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use get_obfuscated_article()
|
Advert | |
|
11-29-2012, 09:16 AM | #3 |
Junior Member
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
|
That doesn't seem right. I'm using this recipe, and it gets the article actually pretty nicely, without any trouble.
I basically want access to the URL that gets returned by "print_version", trim off the last bit, and then go download the link defined in the article for a table (see line 383 here for an example), and then download, soupify, extract the table, and insert into the article directly. To do that, I need the original URL that was used to download the article. It doesn't seem like it should be hard to do. Are you telling me to use "get_obfuscated_article()" to just write a full custom defined method for this article? It doesn't seem like I should have to do that, because as I said, some simple tweaks to the recipe seem to get 99% of the content I need just fine. I want to do this in post-processing because not necessarily every article will have tables, but many of them will have images, which the basic recipe seems to do very nicely. |
11-29-2012, 09:33 AM | #4 |
Junior Member
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
|
Or is there a way to tell Calibre that in addition to the image links, that I want to download the table files and do some processing on them as well?
Sorry, still new to Python in general, and the way the recipes work specifically. |
11-29-2012, 11:04 AM | #5 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There's no big deal in implementing obfuscated_article, just get the soup with self.index_to_soup passing in the url, make whatever changes you want. Save the soup to a temp file and return the path to the temp file. The rest of the recipe processing will then take place as normal on the contents of the temp file.
But if for some reason implementing obfuscated_article freaks you out, you can re-purpose populate_article_metadata to do what you want. Though IMO, that's a lot more hackish than using get_obfuscate_article. Or you can implement preprocess_raw_article() which has the URL available. |
Advert | |
|
11-29-2012, 11:37 AM | #6 |
Junior Member
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
|
It doesn't really freak me out, I was just trying to figure out where and how to do it. I will give this a try when I get a chance, probably won't be for a little bit.
Thank you. |
Tags |
postprocess_html, url |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Print friendly url unrelated to regular url (and javascript) | sleepless | Recipes | 3 | 12-03-2011 10:43 AM |
How get full article when good looking page do not have print version and same url? | newnick | Recipes | 2 | 07-08-2011 03:58 AM |
get print-url and somtimes non-print-url | schuster | Recipes | 4 | 05-28-2011 03:01 AM |
Trying to strip the date from an article URL | Finbar127 | Recipes | 1 | 02-17-2011 03:02 PM |
postprocess_html | marbs | Recipes | 20 | 11-03-2010 10:11 PM |