|03-28-2011, 09:34 AM||#1|
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Replacing tags after using them
I'm trying to clean up some really messy HTML newspaper site's page. They are heavily using tables.
In my recipe I was able to find the needed content, and extract it via keeponly_tags, and remove_tags.
But the article(s) are in an inner table/(thead|tr/td). Which - if I convert the recipe to mobi for my Kindle - doesn't look good. Actually Only the first screen is filled with the text, and the second page is empty.
So I tried to get rid of the unnecessary tags, but without luck.
I tried postprocess_html:
But it gave me a TypeError:
Then I had tried preprocess_regexps, but it gave me empty article pages
The recipe in its actual state (which works fine if you are creating e.g. PDF output) can be reached here: https://github.com/zsoltika/.hu-reci...0_1_nap.recipe
So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?
And one more thing popped into my mind: wouldn't it be nicer, if the various api callables/overrides etc. at http://calibre-ebook.com/user_manual/news_recipe.html will be numbered? I mean I don't get which applies earlier in the process from ['preprocess_html', 'preprocess_regexps', 'keeponly_tags', 'remove_tags'].
Thanks for any help!
|03-28-2011, 10:04 AM||#2|
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
'linearize_tables' : True
def postprocess_html(self, soup, first_fetch): for t in soup.findAll(['table', 'tr', 'td']): t.name = 'div'
|recipes, replacewith, tables|
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Replacing my Sony with K3?||cognym||Amazon Kindle||61||02-02-2011 05:02 PM|
|Replacing my new Kobo - again!||objectman||Kobo Reader||7||09-20-2010 09:00 PM|
|Replacing the battery||AprilHare||Sony Reader||12||04-29-2009 02:08 PM|
|Replacing ¬||PieOPah||Workshop||5||12-17-2008 05:25 PM|
|iLiad Replacing the contentlister||tribble||iRex Developer's Corner||21||06-22-2007 04:58 PM|