Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-29-2012, 08:48 AM   #1
rmflight
Junior Member
rmflight began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
Get article URL in postprocess_html

I need to do some post-processing on an html page after it is downloaded that includes finding div's that have links to html tables. The links are relative links that depend on the original URL of the article that was downloaded: this would be the "link" obtained from the actual rss feed.

Is there a way to get this URL in the postprocess_html function?
rmflight is offline   Reply With Quote
Old 11-29-2012, 08:55 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use get_obfuscated_article()
kovidgoyal is offline   Reply With Quote
Advert
Old 11-29-2012, 09:16 AM   #3
rmflight
Junior Member
rmflight began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
That doesn't seem right. I'm using this recipe, and it gets the article actually pretty nicely, without any trouble.

I basically want access to the URL that gets returned by "print_version", trim off the last bit, and then go download the link defined in the article for a table (see line 383 here for an example), and then download, soupify, extract the table, and insert into the article directly.

To do that, I need the original URL that was used to download the article. It doesn't seem like it should be hard to do.

Are you telling me to use "get_obfuscated_article()" to just write a full custom defined method for this article? It doesn't seem like I should have to do that, because as I said, some simple tweaks to the recipe seem to get 99% of the content I need just fine. I want to do this in post-processing because not necessarily every article will have tables, but many of them will have images, which the basic recipe seems to do very nicely.
rmflight is offline   Reply With Quote
Old 11-29-2012, 09:33 AM   #4
rmflight
Junior Member
rmflight began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
Or is there a way to tell Calibre that in addition to the image links, that I want to download the table files and do some processing on them as well?

Sorry, still new to Python in general, and the way the recipes work specifically.
rmflight is offline   Reply With Quote
Old 11-29-2012, 11:04 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There's no big deal in implementing obfuscated_article, just get the soup with self.index_to_soup passing in the url, make whatever changes you want. Save the soup to a temp file and return the path to the temp file. The rest of the recipe processing will then take place as normal on the contents of the temp file.

But if for some reason implementing obfuscated_article freaks you out, you can re-purpose populate_article_metadata to do what you want. Though IMO, that's a lot more hackish than using get_obfuscate_article.

Or you can implement preprocess_raw_article() which has the URL available.
kovidgoyal is offline   Reply With Quote
Advert
Old 11-29-2012, 11:37 AM   #6
rmflight
Junior Member
rmflight began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2012
Device: Kindle
It doesn't really freak me out, I was just trying to figure out where and how to do it. I will give this a try when I get a chance, probably won't be for a little bit.

Thank you.
rmflight is offline   Reply With Quote
Reply

Tags
postprocess_html, url


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Print friendly url unrelated to regular url (and javascript) sleepless Recipes 3 12-03-2011 10:43 AM
How get full article when good looking page do not have print version and same url? newnick Recipes 2 07-08-2011 03:58 AM
get print-url and somtimes non-print-url schuster Recipes 4 05-28-2011 03:01 AM
Trying to strip the date from an article URL Finbar127 Recipes 1 02-17-2011 03:02 PM
postprocess_html marbs Recipes 20 11-03-2010 10:11 PM


All times are GMT -4. The time now is 08:40 AM.


MobileRead.com is a privately owned, operated and funded community.