MobileRead Forums - View Single Post - Is there a way to see all HTML tags used in an ebook?

lomkiri · 04-10-2025, 08:04 AM

Quote:

Originally Posted by roger64

Your first function -the only one I tried

The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).

The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing).

Quote:

we then can use the Reports Calibre tool

It's a great tools, and much better integrated than this regex-function that is only a work-around. It is clickable at the first place, and points you to the definition of the tag or the class in the css. It gives you also the count for the chaining of some tags (e.g. "div, p") if in the css.

But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev.

Quote:

If we wish to refine further, for example look after all these p tags

It is also possible to add a parameter "incl" in this function, to select only some tags. This param, if not empty, will have precedence over excl (if incl AND excl are defined, only incl will be considered).

If incl = excl = () and max_it = 0, no filter will be applied.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        #### Parameters for Filters: ####
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = (img, svg)    # () for no inclusion
        # Exclusion of some tags, e.g. : 
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # () for no exclusion
        # no display if more occ than max_it, e.g:
        # max_it = 5    # None or 0 for no limit

        incl = ()        # () for no filter
        excl = ()        # () for no filter
        max_it = None    # 0 or None for no filter
        sort = 'name'    # None | 'name' | 'number' (any other value will sort by name)
        reverse = False  # Reverse sorting if True
        #####

        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}

        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        
        if sort == None:
            ind = my_tags.keys()    # order of appearance
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:    # ordered by name
            ind  = sorted(my_tags, reverse=reverse)
        
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

Edit: I've added 2 parameters to sort the output by name or number, reversed or not.
With the default parameters, this version gives the same result than the first version