Quote:
Originally Posted by roger64
Your first function -the only one I tried
|
The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).
The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing).
Quote:
we then can use the Reports Calibre tool
|
It's a great tools, and much better integrated than this regex-function that is only a work-around. It is clickable at the first place, and points you to the definition of the tag or the class in the css. It gives you also the count for the chaining of some tags (e.g. "div, p") if in the css.
But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev.
Quote:
If we wish to refine further, for example look after all these p tags
|
It is also possible to add a parameter "incl" in this function, to select only some tags. This param, if not empty, will have precedence over excl (if incl AND excl are defined, only incl will be considered).
If incl = excl = () and max_it = 0, no filter will be applied.
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
"""
Count the number of occurrences for every html tag in an epub
May be filtered by tag name and by max number of occ.
search regex: <(/w+)
"""
# last passage
if match == None:
#### Parameters for Filters: ####
# No filter at all if excl = incl = (), and if max_it = None
# Include only some tags (if defined, will deactive any exclusion). E.g.:
# incl = (img, svg) # () for no inclusion
# Exclusion of some tags, e.g. :
# excl = ('html', 'meta', 'body', 'title', 'div', 'p') # () for no exclusion
# no display if more occ than max_it, e.g:
# max_it = 5 # None or 0 for no limit
incl = () # () for no filter
excl = () # () for no filter
max_it = None # 0 or None for no filter
sort = 'name' # None | 'name' | 'number' (any other value will sort by name)
reverse = False # Reverse sorting if True
#####
if incl:
my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
else:
my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
print(f'Found a total of {number} tags, with {len(data)} different tags')
if incl and excl:
print('You have defined inclusions AND exclusions. Only inclusions have been treated')
if len(my_tags) < len(data):
print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
if sort == None:
ind = my_tags.keys() # order of appearance
elif sort.lower() == 'number':
ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
else: # ordered by name
ind = sorted(my_tags, reverse=reverse)
for key in ind:
print(f'{key} : {my_tags[key]}')
return
# normal passage
tag = match[1]
data[tag] = data.setdefault(tag, 0) +1
return match[0]
replace.call_after_last_match = True # Ask for last passage
Edit: I've added 2 parameters to sort the output by name or number, reversed or not.
With the default parameters, this version gives the same result than the first version