MobileRead Forums - View Single Post

lomkiri · 04-11-2025, 09:49 PM

Original discussions here and here.

In those discussions, it was asked a feature in calibre to know the number of occurrences for each html tag in the text files. The tool "Reports" doesn't give this information if a tag is not in the css. As Kovid thinks it's not a useful feature, I proposed a simple work-aronud with a regex function, not as practical as the tool "reports", of course. I publish it here for it may be of some use for others.

I put two versions, a very simple that gives the result for all tags found, ordered by name. The logic in it appears clearly.
And a slightly more complex with some parameters (filters for tags or max values, chosen order). With the default value of the parameters, it gives the same results as the simpler one.
The parameter "max_it" is there so it's easier to locate erroneous tag names (attends the example given by Karellen is his feature request to Kovid)

find : <(\w+)
replace : one of the two functions below (prefer the second one, more powerful)
Click on "Replace all", so you 'll get all the tags of the epub.
The dialog box gives a number of modifications (with is the number total for all tags) but the files are not modified (although the button "save epub" will be enabled).

The bare function, without parameters (will print all occ. for all tags):

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    search regex: <(/w+)    
    """
    
    # last passage
    if match == None:
        print(f'Found a total of {number} tags, with {len(data)} different tags\n')
        for key in sorted(data):
            print(f'{key}: {data[key]}')
        return
    
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

The same function, but with some possibility of filter and sorting:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)    
    """
	
    def plural(word, n):
        return word + ('s' if n >1 else '')

    # last passage
    if match == None:

        #### Parameters for Filters: incl, excl, max_it. Also: sort, reverse
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = [img, svg]    # [] for no inclusion
        # Exclusion of some tags, e.g.: 
        # excl = ['html', 'meta', 'body', 'title', 'div', 'p']    # [] for no exclusion
        # no display if more occurrences than max_it, e.g.:
        # max_it = 5    # None or 0 for no limit
        # Sort by name or by number of occ. :  None | 'name' | 'number' (any other value will sort by name)
        # sort = 'number'    # None | 'name' | 'number' (any other value will sort by name)
        # reverse = False    # reverse order : False or True

        incl = []        # () for no filter, ('div',) for only one tag
        excl = []        # () for no filter
        max_it = None    # 0 or None for no limit
        sort = 'name'    # None for no sorting
        reverse = False
        #####
        
        # Prepare the print of the filters (if any), for information:
        print_param = []
        if incl:
            print_param.append('Include only those tags: ' + ', '.join(incl))
        if excl:
            print_param.append('Exclude those tags: ' + ', '.join(excl))
        if max_it:
            print_param.append(f"Don't print tags with more than {max_it} {plural('occurrence', max_it)}")
            
        # counting by tag
        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}

        # print headers
        print(f'Found a total of {number} {plural("occurrence", number)} and {len(data)} different {plural("tag", len(data))}')
        if print_param:
            print(6*' ' + '\n      '.join(print_param))
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) == 0:
            print('No occurrences found with those criterias')
        elif len(my_tags) < len(data):
            ntags = sum(my_tags.values())
            print(f'Selected a total of {ntags} {plural("occurrence", ntags)} and {len(my_tags)} different {plural("tag", len(my_tags))}')
        print('')
        
        if sort == None:
            ind = my_tags.keys()
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:
            ind  = sorted(my_tags, reverse=reverse)
 
        # Print the occurrences by tag
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return
    # End of last passage

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

The result (with no parameter):

Code:

Found a total of 2613 occurrences and 14 different tags

a : 41
body : 20
br : 7
div : 56
h1 : 11
[ect.]