![]() |
#1 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
|
Is there a way to see all HTML tags used in an ebook?
Hello,
I would like to be able to look at a list of all HTML tags which have been used in an ebook, ideally with the numer of occurrences. The Reports feature/tool of the editor provides something similar for CSS style rules and classes, words, links etc., but not for HTML tags. I wasn't able to find such functionality and would like to avoid using regex for this. Thanks! |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
There is no report to generate this list.
A couple of years ago I did ask @kovidgoyal if he could implement such a list, but he felt there was no need for it, which I didn't really agree with. See here... https://www.mobileread.com/forums/sh...d.php?t=357312 |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
|
Ah, a shame. At least I am not alone, the problem you describe in your post is the exact same I'm having and what motivated me to ask here.
I'm assuming you have found some kind of solution of your own for this, but just in case anyone's interested: I've built a (ugly) command chain for myself to check for such tags in my extracted EPUB files: Code:
grep -r -n -I -P "<[^\s/]*>" | grep -P "\.xhtml" | grep -P -v "nav\.xhtml" | grep -P -v "<head>" | grep -P -v "<title>" | grep -P -v "<body>" | grep -P -v "<p>" | grep -P -v "<li>" | grep -P -v "<tbody>" | grep -P -v "<tr>" | grep -P -v "<td>" | awk '{print $0,"\n"}' | head -n -1 |
![]() |
![]() |
![]() |
#4 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
What about a search/replace on the whole epub, using a regex-fonction ?
find : <(\w+) replace : the function below Do a "Replace all", so you 'll get all the tags of the epub. The number of replacements in the dialog box is the total of all tags, but no change is done in the epub. Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): # last passage if match == None: if not data: print('No tag found') else: print(f'Found a total of {number} tags, with {len(data)} different tags\n') for key in sorted(data): print(f'{key}: {data[key]}') return # normal passage tag = match[1] data[tag] = data.setdefault(tag, 0) +1 return match[0] replace.call_after_last_match = True # Ask for last passage Code:
Debug output from __count tags Found a total of 12605 tags, with 22 different tags a: 6 body: 78 br: 14 div: 143 em: 45 figure: 2 h1: 7 h2: 64 [etc.] Last edited by lomkiri; 04-02-2025 at 10:05 PM. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Thanks @lomkiri
That is a great workaround. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
|
Yes, thank you, I prefer your workaround.
|
![]() |
![]() |
![]() |
#7 |
Bookish
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,017
Karma: 2003162
Join Date: Jun 2011
Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11
|
@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?
|
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
|
![]() |
![]() |
![]() |
#9 |
Bookish
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,017
Karma: 2003162
Join Date: Jun 2011
Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11
|
Ah, I missed that. Well, Kovid seems to have made his point already then.
|
![]() |
![]() |
![]() |
#10 | |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
@Karellen : From your other thread, I see this :
Quote:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): """ Count the number of occurrences for every html tag in an epub May be filtered by tag name and by max number of occ. search regex: <(/w+) """ # last passage if match == None: # Exclusions: # excl = ('html', 'meta', 'body', 'title', 'div', 'p') # or () for no exclusion # max_it = 5 # no print if more occ than this. None or 0 for no limit excl = () max_it = None my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)} print(f'Found a total of {number} tags, with {len(data)} different tags') if len(my_tags) < len(data): print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags') for key in sorted(my_tags): print(f'{key}: {my_tags[key]}') return # normal passage tag = match[1] data[tag] = data.setdefault(tag, 0) +1 return match[0] replace.call_after_last_match = True # Ask for last passage Note : I've suppressed the error test if not data since we must have at least an html tag in a valid epub Last edited by lomkiri; 04-04-2025 at 07:14 AM. Reason: max_it can be 0 or None for no limit of occurrences |
|
![]() |
![]() |
![]() |
#11 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
Modified: "max_it" may be set to 0 or None for no limit of occurrences.
Not really necessary, but it's cleaner than putting max_it = 1000000 :-) Last edited by lomkiri; 04-03-2025 at 02:40 PM. |
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Thanks @lomkiri, I'll check this out also.
|
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Thanks @lomkiri
Your first function -the only one I tried- works quite fine. I added it to my saved searches. If we wish to refine further, for example look after all these p tags, we then can use the Reports Calibre tool which gives us the number of classes. Last edited by roger64; 04-09-2025 at 09:38 PM. |
![]() |
![]() |
![]() |
#14 | ||
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).
The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing). Quote:
But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev. Quote:
If incl = excl = () and max_it = 0, no filter will be applied. Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): """ Count the number of occurrences for every html tag in an epub May be filtered by tag name and by max number of occ. search regex: <(/w+) """ # last passage if match == None: #### Parameters for Filters: #### # No filter at all if excl = incl = (), and if max_it = None # Include only some tags (if defined, will deactive any exclusion). E.g.: # incl = (img, svg) # () for no inclusion # Exclusion of some tags, e.g. : # excl = ('html', 'meta', 'body', 'title', 'div', 'p') # () for no exclusion # no display if more occ than max_it, e.g: # max_it = 5 # None or 0 for no limit incl = () # () for no filter excl = () # () for no filter max_it = None # 0 or None for no filter sort = 'name' # None | 'name' | 'number' (any other value will sort by name) reverse = False # Reverse sorting if True ##### if incl: my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)} else: my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)} print(f'Found a total of {number} tags, with {len(data)} different tags') if incl and excl: print('You have defined inclusions AND exclusions. Only inclusions have been treated') if len(my_tags) < len(data): print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags') if sort == None: ind = my_tags.keys() # order of appearance elif sort.lower() == 'number': ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse) else: # ordered by name ind = sorted(my_tags, reverse=reverse) for key in ind: print(f'{key} : {my_tags[key]}') return # normal passage tag = match[1] data[tag] = data.setdefault(tag, 0) +1 return match[0] replace.call_after_last_match = True # Ask for last passage With the default parameters, this version gives the same result than the first version Last edited by lomkiri; 04-10-2025 at 12:16 PM. Reason: Added: possibility to sort the result by number of occ. |
||
![]() |
![]() |
![]() |
#15 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
Added a param. for sorting by name (default), by number of occurrences, or by order of appearance
Added a param. for reversed sorting With the default parameters, the function behaves as the first version. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex to remove html tags | vijer | Sigil | 8 | 04-16-2021 03:05 PM |
html tags always have a class? | hobnail | Workshop | 4 | 04-20-2020 02:18 PM |
Comparison of supported html and css tags in various ebook formats | GrannyGrump | Workshop | 2 | 07-11-2016 09:33 PM |
HTML input plugin stripping text within toc tags in child html file | nimblebooks | Conversion | 3 | 02-21-2012 03:24 PM |
Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 11:25 AM |