| 
			
			 | 
		#1 | 
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 6 
				Karma: 10 
				Join Date: Mar 2024 
				
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Is there a way to see all HTML tags used in an ebook?
			 
			
			
			Hello, 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I would like to be able to look at a list of all HTML tags which have been used in an ebook, ideally with the numer of occurrences. The Reports feature/tool of the editor provides something similar for CSS style rules and classes, words, links etc., but not for HTML tags. I wasn't able to find such functionality and would like to avoid using regex for this. Thanks!  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,688 
				Karma: 9500498 
				Join Date: Sep 2021 
				Location: Australia 
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			There is no report to generate this list. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	A couple of years ago I did ask @kovidgoyal if he could implement such a list, but he felt there was no need for it, which I didn't really agree with. See here... https://www.mobileread.com/forums/sh...d.php?t=357312  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 6 
				Karma: 10 
				Join Date: Mar 2024 
				
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Ah, a shame. At least I am not alone, the problem you describe in your post is the exact same I'm having and what motivated me to ask here. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	I'm assuming you have found some kind of solution of your own for this, but just in case anyone's interested: I've built a (ugly) command chain for myself to check for such tags in my extracted EPUB files: Code: 
	grep -r -n -I -P "<[^\s/]*>" | grep -P "\.xhtml" | grep -P -v "nav\.xhtml" | grep -P -v "<head>" | grep -P -v "<title>" | grep -P -v "<body>" | grep -P -v "<p>" | grep -P -v "<li>" | grep -P -v "<tbody>" | grep -P -v "<tr>" | grep -P -v "<td>" | awk '{print $0,"\n"}' | head -n -1
 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 173 
				Karma: 1497966 
				Join Date: Jul 2021 
				
				
				
				Device: N/A 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			What about a search/replace on the whole epub, using a regex-fonction ? 
		
	
		
		
		
		
		
		
		
		
		
		
		
			find : <(\w+) replace : the function below Do a "Replace all", so you 'll get all the tags of the epub. The number of replacements in the dialog box is the total of all tags, but no change is done in the epub. Code: 
	def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    # last passage
    if match == None:
        if not data:
            print('No tag found')
        else:
            print(f'Found a total of {number} tags, with {len(data)} different tags\n')
            for key in sorted(data):
                print(f'{key}: {data[key]}')
        return
    
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]
replace.call_after_last_match = True    # Ask for last passage
Code: 
	Debug output from __count tags Found a total of 12605 tags, with 22 different tags a: 6 body: 78 br: 14 div: 143 em: 45 figure: 2 h1: 7 h2: 64 [etc.] Last edited by lomkiri; 04-02-2025 at 11:05 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,688 
				Karma: 9500498 
				Join Date: Sep 2021 
				Location: Australia 
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks @lomkiri  
		
	
		
		
		
		
		
		
		
		
		
		
	
	That is a great workaround.  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Junior Member 
			
			![]() Posts: 6 
				Karma: 10 
				Join Date: Mar 2024 
				
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Yes, thank you, I prefer your workaround.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 Bookish 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,049 
				Karma: 2006208 
				Join Date: Jun 2011 
				
				
				
				Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?
		 
		
	
		
		
			 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,688 
				Karma: 9500498 
				Join Date: Sep 2021 
				Location: Australia 
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	|
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Bookish 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,049 
				Karma: 2006208 
				Join Date: Jun 2011 
				
				
				
				Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Ah, I missed that. Well, Kovid seems to have made his point already then.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 173 
				Karma: 1497966 
				Join Date: Jul 2021 
				
				
				
				Device: N/A 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			@Karellen : From your other thread, I see this :  
		
	
		
		
		
		
		
		
		
		
		
		
		
			Quote: 
	
 Code: 
	def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:
        # Exclusions:
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # or () for no exclusion
        # max_it = 5    # no print if more occ than this. None or 0 for no limit
        excl = ()
        max_it = None
        my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        for key in sorted(my_tags):
            print(f'{key}: {my_tags[key]}')
        return
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]
replace.call_after_last_match = True    # Ask for last passage
Note : I've suppressed the error test if not data since we must have at least an html tag in a valid epub Last edited by lomkiri; 04-04-2025 at 08:14 AM. Reason: max_it can be 0 or None for no limit of occurrences  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#11 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 173 
				Karma: 1497966 
				Join Date: Jul 2021 
				
				
				
				Device: N/A 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Modified: "max_it" may be set to 0 or None for no limit of occurrences. 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Not really necessary, but it's cleaner than putting max_it = 1000000 :-) Last edited by lomkiri; 04-03-2025 at 03:40 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#12 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,688 
				Karma: 9500498 
				Join Date: Sep 2021 
				Location: Australia 
				
				
				Device: Kobo Libra 2 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks @lomkiri, I'll check this out also.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#13 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,625 
				Karma: 3120635 
				Join Date: Jan 2009 
				
				
				
				Device: Kindle PW3 (wifi) 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks @lomkiri 
		
	
		
		
		
		
		
		
		
		
		
		
		
			Your first function -the only one I tried- works quite fine. I added it to my saved searches. If we wish to refine further, for example look after all these p tags, we then can use the Reports Calibre tool which gives us the number of classes. Last edited by roger64; 04-09-2025 at 10:38 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#14 | ||
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 173 
				Karma: 1497966 
				Join Date: Jul 2021 
				
				
				
				Device: N/A 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).  
		
	
		
		
		
		
		
		
		
		
		
		
		
			The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing). Quote: 
	
 But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev. Quote: 
	
 If incl = excl = () and max_it = 0, no filter will be applied. Code: 
	def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:
        #### Parameters for Filters: ####
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = (img, svg)    # () for no inclusion
        # Exclusion of some tags, e.g. : 
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # () for no exclusion
        # no display if more occ than max_it, e.g:
        # max_it = 5    # None or 0 for no limit
        incl = ()        # () for no filter
        excl = ()        # () for no filter
        max_it = None    # 0 or None for no filter
        sort = 'name'    # None | 'name' | 'number' (any other value will sort by name)
        reverse = False  # Reverse sorting if True
        #####
        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        
        if sort == None:
            ind = my_tags.keys()    # order of appearance
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:    # ordered by name
            ind  = sorted(my_tags, reverse=reverse)
        
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]
replace.call_after_last_match = True    # Ask for last passage
With the default parameters, this version gives the same result than the first version Last edited by lomkiri; 04-10-2025 at 01:16 PM. Reason: Added: possibility to sort the result by number of occ.  | 
||
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#15 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 173 
				Karma: 1497966 
				Join Date: Jul 2021 
				
				
				
				Device: N/A 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Added a param. for sorting by name (default), by number of occurrences, or by order of appearance 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Added a param. for reversed sorting With the default parameters, the function behaves as the first version.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Regex to remove html tags | vijer | Sigil | 8 | 04-16-2021 04:05 PM | 
| html tags always have a class? | hobnail | Workshop | 4 | 04-20-2020 03:18 PM | 
| Comparison of supported html and css tags in various ebook formats | GrannyGrump | Workshop | 2 | 07-11-2016 10:33 PM | 
| HTML input plugin stripping text within toc tags in child html file | nimblebooks | Conversion | 3 | 02-21-2012 04:24 PM | 
| Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 12:25 PM |