Code:
articleHeaderRegex= '^.*contentHeader__wrapper.*$'
editorLetterHeaderRegex = "^.*contentHeader--vertical__wrapper.*$"
articleContentRegex = "^.*contentbody__wrapper.*$"
imagePlaceHolderRegex = "^.*image__placeholder.*$"
advertisementRegex = "^.*sliderAd__wrapper.*$"
keep_only_tags = [
dict(name='header', attrs={'class': re.compile(editorLetterHeaderRegex, re.IGNORECASE)}),
dict(name='header', attrs={'class': re.compile(articleHeaderRegex, re.IGNORECASE)}),
dict(name='div', attrs={'class': re.compile(articleContentRegex, re.IGNORECASE)})
]
remove_tags = [
dict(name="aside"),
dict(name="svg"),
dict(name="blockquote"),
dict(name="img", attrs={'class': re.compile(imagePlaceHolderRegex, re.IGNORECASE)}),
dict(name="div", attrs={'class': re.compile(advertisementRegex, re.IGNORECASE)}),
https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/recipes/mit_technology_review.recipe
Only the
contentBody__wrapper works which is the body & most of the article.
the
contentHeader__wrapper is to be changed, but from what i found is that there's different header tags for different articles.
contentArticleHeader--fullBleed__intro--30Y0q
contentArticleHeader__title--rp01p
contentArticleHeader--vertical__intro--2soVS
help find an easier way to do this.