Replacing tags after using them

hiperlink · 03-28-2011, 08:34 AM

Hi All!

I'm trying to clean up some really messy HTML newspaper site's page. They are heavily using tables.

In my recipe I was able to find the needed content, and extract it via keeponly_tags, and remove_tags.

Spoiler:

But the article(s) are in an inner table/(thead|tr/td). Which - if I convert the recipe to mobi for my Kindle - doesn't look good. Actually Only the first screen is filled with the text, and the second page is empty.

So I tried to get rid of the unnecessary tags, but without luck.

I tried postprocess_html:

Spoiler:

But it gave me a TypeError:

Spoiler:

Then I had tried preprocess_regexps, but it gave me empty article pages

Spoiler:

The recipe in its actual state (which works fine if you are creating e.g. PDF output) can be reached here: https://github.com/zsoltika/.hu-reci...0_1_nap.recipe

So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?

And one more thing popped into my mind: wouldn't it be nicer, if the various api callables/overrides etc. at http://calibre-ebook.com/user_manual/news_recipe.html will be numbered? I mean I don't get which applies earlier in the process from ['preprocess_html', 'preprocess_regexps', 'keeponly_tags', 'remove_tags'].

Thanks for any help!

Starson17 · 03-28-2011, 09:04 AM

Quote:

Originally Posted by hiperlink

So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?

The simplest way is:

Code:

'linearize_tables' : True

Alternatively:

Code:

    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

hiperlink · 03-28-2011, 10:23 AM

Worked like a charme, thank You!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Replacing my Sony with K3?	cognym	Amazon Kindle	61	02-02-2011 04:02 PM
Replacing my new Kobo - again!	objectman	Kobo Reader	7	09-20-2010 08:00 PM
Replacing the battery	AprilHare	Sony Reader	12	04-29-2009 01:08 PM
Replacing ¬	PieOPah	Workshop	5	12-17-2008 04:25 PM
iLiad Replacing the contentlister	tribble	iRex Developer's Corner	21	06-22-2007 03:58 PM

03-28-2011, 10:23 AM	#3
hiperlink Enthusiast Posts: 45 Karma: 10 Join Date: Dec 2010 Device: Kindle 3 Wifi only	Worked like a charme, thank You!

Advert