MobileRead Forums - View Single Post

hiperlink · 03-28-2011, 08:34 AM

Hi All!

I'm trying to clean up some really messy HTML newspaper site's page. They are heavily using tables.

In my recipe I was able to find the needed content, and extract it via keeponly_tags, and remove_tags.

Spoiler:

But the article(s) are in an inner table/(thead|tr/td). Which - if I convert the recipe to mobi for my Kindle - doesn't look good. Actually Only the first screen is filled with the text, and the second page is empty.

So I tried to get rid of the unnecessary tags, but without luck.

I tried postprocess_html:

Spoiler:

But it gave me a TypeError:

Spoiler:

Then I had tried preprocess_regexps, but it gave me empty article pages

Spoiler:

The recipe in its actual state (which works fine if you are creating e.g. PDF output) can be reached here: https://github.com/zsoltika/.hu-reci...0_1_nap.recipe

So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?

And one more thing popped into my mind: wouldn't it be nicer, if the various api callables/overrides etc. at http://calibre-ebook.com/user_manual/news_recipe.html will be numbered? I mean I don't get which applies earlier in the process from ['preprocess_html', 'preprocess_regexps', 'keeponly_tags', 'remove_tags'].

Thanks for any help!