View Single Post
Old 01-02-2022, 02:51 AM   #1
unkn0wn
Guru
unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.unkn0wn understands the Henderson-Hasselbalch Equation.
 
Posts: 630
Karma: 85520
Join Date: May 2021
Device: kindle
MIT Technology Review, the recipe still works but without header content.

Code:
articleHeaderRegex= '^.*contentHeader__wrapper.*$'
    editorLetterHeaderRegex = "^.*contentHeader--vertical__wrapper.*$"
    articleContentRegex = "^.*contentbody__wrapper.*$"
    imagePlaceHolderRegex = "^.*image__placeholder.*$"
    advertisementRegex = "^.*sliderAd__wrapper.*$"

    keep_only_tags = [
        dict(name='header',  attrs={'class': re.compile(editorLetterHeaderRegex, re.IGNORECASE)}),
        dict(name='header',  attrs={'class': re.compile(articleHeaderRegex, re.IGNORECASE)}),
        dict(name='div',  attrs={'class': re.compile(articleContentRegex, re.IGNORECASE)})
    ]
    remove_tags = [
        dict(name="aside"),
        dict(name="svg"),
        dict(name="blockquote"),
        dict(name="img", attrs={'class': re.compile(imagePlaceHolderRegex, re.IGNORECASE)}),
        dict(name="div", attrs={'class': re.compile(advertisementRegex, re.IGNORECASE)}),

https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/recipes/mit_technology_review.recipe


Only the contentBody__wrapper works which is the body & most of the article.

the contentHeader__wrapper is to be changed, but from what i found is that there's different header tags for different articles.

contentArticleHeader--fullBleed__intro--30Y0q
contentArticleHeader__title--rp01p
contentArticleHeader--vertical__intro--2soVS


help find an easier way to do this.
unkn0wn is offline   Reply With Quote