Is there a way to see all HTML tags used in an ebook?

Urnoev · 04-02-2025, 09:47 AM

Hello,

I would like to be able to look at a list of all HTML tags which have been used in an ebook, ideally with the numer of occurrences.
The Reports feature/tool of the editor provides something similar for CSS style rules and classes, words, links etc., but not for HTML tags.

I wasn't able to find such functionality and would like to avoid using regex for this.

Thanks!

Karellen · 04-02-2025, 02:30 PM

There is no report to generate this list.
A couple of years ago I did ask @kovidgoyal if he could implement such a list, but he felt there was no need for it, which I didn't really agree with.
See here... https://www.mobileread.com/forums/sh...d.php?t=357312

Urnoev · 04-02-2025, 02:38 PM

Ah, a shame. At least I am not alone, the problem you describe in your post is the exact same I'm having and what motivated me to ask here.

I'm assuming you have found some kind of solution of your own for this, but just in case anyone's interested:
I've built a (ugly) command chain for myself to check for such tags in my extracted EPUB files:

Code:

grep -r -n -I -P "<[^\s/]*>" | grep -P "\.xhtml" | grep -P -v "nav\.xhtml" | grep -P -v "<head>" | grep -P -v "<title>" | grep -P -v "<body>" | grep -P -v "<p>" | grep -P -v "<li>" | grep -P -v "<tbody>" | grep -P -v "<tr>" | grep -P -v "<td>" | awk '{print $0,"\n"}' | head -n -1

lomkiri · 04-02-2025, 09:51 PM

What about a search/replace on the whole epub, using a regex-fonction ?

find : <(\w+)
replace : the function below
Do a "Replace all", so you 'll get all the tags of the epub.
The number of replacements in the dialog box is the total of all tags, but no change is done in the epub.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    # last passage
    if match == None:
        if not data:
            print('No tag found')
        else:
            print(f'Found a total of {number} tags, with {len(data)} different tags\n')
            for key in sorted(data):
                print(f'{key}: {data[key]}')
        return
    
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

The result will be :

Code:

Debug output from __count tags

Found a total of 12605 tags, with 22 different tags

a: 6
body: 78
br: 14
div: 143
em: 45
figure: 2
h1: 7
h2: 64
[etc.]

Karellen · 04-03-2025, 01:46 AM

Thanks @lomkiri
That is a great workaround.

Urnoev · 04-03-2025, 04:43 AM

Yes, thank you, I prefer your workaround.

DrChiper · 04-03-2025, 04:45 AM

@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?

Karellen · 04-03-2025, 05:30 AM

Quote:

Originally Posted by DrChiper

@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?

Yea, I already asked for that. Would be nice...
See my link in the second post of this thread.

DrChiper · 04-03-2025, 05:36 AM

Ah, I missed that. Well, Kovid seems to have made his point already then.

lomkiri · 04-03-2025, 07:48 AM

@Karellen : From your other thread, I see this :

Quote:

I am surprised that you think there is too little use for a tag report, as users can very quickly spot tags that need investigating. Personally, I have been caught out so many times- that tag which is used once, and makes you wonder why it is even there.

In that case, you may easily print only tags that have less than x entries, or have a list of exclusion for common tags as body, div, p, and so on.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        # Exclusions:
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # or () for no exclusion
        # max_it = 5    # no print if more occ than this. None or 0 for no limit
        excl = ()
        max_it = None

        my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        for key in sorted(my_tags):
            print(f'{key}: {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

Note : You may add in the same way a filter "min_it", although I don't see a use for it
Note : I've suppressed the error test if not data since we must have at least an html tag in a valid epub

lomkiri · 04-03-2025, 02:36 PM

Modified: "max_it" may be set to 0 or None for no limit of occurrences.
Not really necessary, but it's cleaner than putting max_it = 1000000 :-)

Karellen · 04-04-2025, 03:16 PM

Thanks @lomkiri, I'll check this out also.

roger64 · 04-09-2025, 09:30 PM

Thanks @lomkiri

Your first function -the only one I tried- works quite fine. I added it to my saved searches.

If we wish to refine further, for example look after all these p tags, we then can use the Reports Calibre tool which gives us the number of classes.

lomkiri · 04-10-2025, 08:04 AM

Quote:

Originally Posted by roger64

Your first function -the only one I tried

The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).

The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing).

Quote:

we then can use the Reports Calibre tool

It's a great tools, and much better integrated than this regex-function that is only a work-around. It is clickable at the first place, and points you to the definition of the tag or the class in the css. It gives you also the count for the chaining of some tags (e.g. "div, p") if in the css.

But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev.

Quote:

If we wish to refine further, for example look after all these p tags

It is also possible to add a parameter "incl" in this function, to select only some tags. This param, if not empty, will have precedence over excl (if incl AND excl are defined, only incl will be considered).

If incl = excl = () and max_it = 0, no filter will be applied.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        #### Parameters for Filters: ####
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = (img, svg)    # () for no inclusion
        # Exclusion of some tags, e.g. : 
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # () for no exclusion
        # no display if more occ than max_it, e.g:
        # max_it = 5    # None or 0 for no limit

        incl = ()        # () for no filter
        excl = ()        # () for no filter
        max_it = None    # 0 or None for no filter
        sort = 'name'    # None | 'name' | 'number' (any other value will sort by name)
        reverse = False  # Reverse sorting if True
        #####

        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}

        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        
        if sort == None:
            ind = my_tags.keys()    # order of appearance
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:    # ordered by name
            ind  = sorted(my_tags, reverse=reverse)
        
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

Edit: I've added 2 parameters to sort the output by name or number, reversed or not.
With the default parameters, this version gives the same result than the first version

lomkiri · 04-10-2025, 12:20 PM

Added a param. for sorting by name (default), by number of occurrences, or by order of appearance
Added a param. for reversed sorting
With the default parameters, the function behaves as the first version.

04-02-2025, 09:47 AM	#1
Urnoev Junior Member Posts: 6 Karma: 10 Join Date: Mar 2024 Device: Kobo Libra 2	Is there a way to see all HTML tags used in an ebook? Hello, I would like to be able to look at a list of all HTML tags which have been used in an ebook, ideally with the numer of occurrences. The Reports feature/tool of the editor provides something similar for CSS style rules and classes, words, links etc., but not for HTML tags. I wasn't able to find such functionality and would like to avoid using regex for this. Thanks!

04-02-2025, 02:38 PM	#3
Urnoev Junior Member Posts: 6 Karma: 10 Join Date: Mar 2024 Device: Kobo Libra 2	Ah, a shame. At least I am not alone, the problem you describe in your post is the exact same I'm having and what motivated me to ask here. I'm assuming you have found some kind of solution of your own for this, but just in case anyone's interested: I've built a (ugly) command chain for myself to check for such tags in my extracted EPUB files: Code: grep -r -n -I -P "<[^\s/]*>" \| grep -P "\.xhtml" \| grep -P -v "nav\.xhtml" \| grep -P -v "<head>" \| grep -P -v "<title>" \| grep -P -v "<body>" \| grep -P -v "<p>" \| grep -P -v "<li>" \| grep -P -v "<tbody>" \| grep -P -v "<tr>" \| grep -P -v "<td>" \| awk '{print $0,"\n"}' \| head -n -1

04-03-2025, 04:45 AM	#7
DrChiper Bookish Posts: 1,017 Karma: 2003162 Join Date: Jun 2011 Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11	@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)? Attached Thumbnails

04-03-2025, 02:36 PM	#11
lomkiri Groupie Posts: 167 Karma: 1497966 Join Date: Jul 2021 Device: N/A	Modified: "max_it" may be set to 0 or None for no limit of occurrences. Not really necessary, but it's cleaner than putting max_it = 1000000 :-) Last edited by lomkiri; 04-03-2025 at 02:40 PM.

04-09-2025, 09:30 PM	#13
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Thanks @lomkiri Your first function -the only one I tried- works quite fine. I added it to my saved searches. If we wish to refine further, for example look after all these p tags, we then can use the Reports Calibre tool which gives us the number of classes. Last edited by roger64; 04-09-2025 at 09:38 PM.

04-02-2025, 02:30 PM	#2
Karellen Wizard Posts: 1,611 Karma: 9500498 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	There is no report to generate this list. A couple of years ago I did ask @kovidgoyal if he could implement such a list, but he felt there was no need for it, which I didn't really agree with. See here... https://www.mobileread.com/forums/sh...d.php?t=357312

04-03-2025, 01:46 AM	#5
Karellen Wizard Posts: 1,611 Karma: 9500498 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	Thanks @lomkiri That is a great workaround.

04-03-2025, 04:43 AM	#6
Urnoev Junior Member Posts: 6 Karma: 10 Join Date: Mar 2024 Device: Kobo Libra 2	Yes, thank you, I prefer your workaround.

04-03-2025, 05:36 AM	#9
DrChiper Bookish Posts: 1,017 Karma: 2003162 Join Date: Jun 2011 Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11	Ah, I missed that. Well, Kovid seems to have made his point already then.

04-04-2025, 03:16 PM	#12
Karellen Wizard Posts: 1,611 Karma: 9500498 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	Thanks @lomkiri, I'll check this out also.

04-10-2025, 12:20 PM	#15
lomkiri Groupie Posts: 167 Karma: 1497966 Join Date: Jul 2021 Device: N/A	Added a param. for sorting by name (default), by number of occurrences, or by order of appearance Added a param. for reversed sorting With the default parameters, the function behaves as the first version.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex to remove html tags	vijer	Sigil	8	04-16-2021 03:05 PM
html tags always have a class?	hobnail	Workshop	4	04-20-2020 02:18 PM
Comparison of supported html and css tags in various ebook formats	GrannyGrump	Workshop	2	07-11-2016 09:33 PM
HTML input plugin stripping text within toc tags in child html file	nimblebooks	Conversion	3	02-21-2012 03:24 PM
Problem with html -> Mobi conversion - html tags visible.	khromov	Calibre	9	08-06-2011 11:25 AM

Advert

Advert