MobileRead Forums - View Single Post - Is there a way to see all HTML tags used in an ebook?

lomkiri · 04-03-2025, 08:48 AM

@Karellen : From your other thread, I see this :

Quote:

I am surprised that you think there is too little use for a tag report, as users can very quickly spot tags that need investigating. Personally, I have been caught out so many times- that tag which is used once, and makes you wonder why it is even there.

In that case, you may easily print only tags that have less than x entries, or have a list of exclusion for common tags as body, div, p, and so on.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        # Exclusions:
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # or () for no exclusion
        # max_it = 5    # no print if more occ than this. None or 0 for no limit
        excl = ()
        max_it = None

        my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        for key in sorted(my_tags):
            print(f'{key}: {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

Note : You may add in the same way a filter "min_it", although I don't see a use for it
Note : I've suppressed the error test if not data since we must have at least an html tag in a valid epub