Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 04-02-2025, 09:47 AM   #1
Urnoev
Junior Member
Urnoev began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
Is there a way to see all HTML tags used in an ebook?

Hello,

I would like to be able to look at a list of all HTML tags which have been used in an ebook, ideally with the numer of occurrences.
The Reports feature/tool of the editor provides something similar for CSS style rules and classes, words, links etc., but not for HTML tags.

I wasn't able to find such functionality and would like to avoid using regex for this.

Thanks!
Urnoev is offline   Reply With Quote
Old 04-02-2025, 02:30 PM   #2
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
There is no report to generate this list.
A couple of years ago I did ask @kovidgoyal if he could implement such a list, but he felt there was no need for it, which I didn't really agree with.
See here... https://www.mobileread.com/forums/sh...d.php?t=357312
Karellen is offline   Reply With Quote
Advert
Old 04-02-2025, 02:38 PM   #3
Urnoev
Junior Member
Urnoev began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
Ah, a shame. At least I am not alone, the problem you describe in your post is the exact same I'm having and what motivated me to ask here.

I'm assuming you have found some kind of solution of your own for this, but just in case anyone's interested:
I've built a (ugly) command chain for myself to check for such tags in my extracted EPUB files:
Code:
grep -r -n -I -P "<[^\s/]*>" | grep -P "\.xhtml" | grep -P -v "nav\.xhtml" | grep -P -v "<head>" | grep -P -v "<title>" | grep -P -v "<body>" | grep -P -v "<p>" | grep -P -v "<li>" | grep -P -v "<tbody>" | grep -P -v "<tr>" | grep -P -v "<td>" | awk '{print $0,"\n"}' | head -n -1
Urnoev is offline   Reply With Quote
Old 04-02-2025, 09:51 PM   #4
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
What about a search/replace on the whole epub, using a regex-fonction ?

find : <(\w+)
replace : the function below
Do a "Replace all", so you 'll get all the tags of the epub.
The number of replacements in the dialog box is the total of all tags, but no change is done in the epub.

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    # last passage
    if match == None:
        if not data:
            print('No tag found')
        else:
            print(f'Found a total of {number} tags, with {len(data)} different tags\n')
            for key in sorted(data):
                print(f'{key}: {data[key]}')
        return
    
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage
The result will be :
Code:
Debug output from __count tags

Found a total of 12605 tags, with 22 different tags

a: 6
body: 78
br: 14
div: 143
em: 45
figure: 2
h1: 7
h2: 64
[etc.]

Last edited by lomkiri; 04-02-2025 at 10:05 PM.
lomkiri is offline   Reply With Quote
Old 04-03-2025, 01:46 AM   #5
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Thanks @lomkiri
That is a great workaround.
Karellen is offline   Reply With Quote
Advert
Old 04-03-2025, 04:43 AM   #6
Urnoev
Junior Member
Urnoev began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2024
Device: Kobo Libra 2
Yes, thank you, I prefer your workaround.
Urnoev is offline   Reply With Quote
Old 04-03-2025, 04:45 AM   #7
DrChiper
Bookish
DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.
 
DrChiper's Avatar
 
Posts: 1,017
Karma: 2003162
Join Date: Jun 2011
Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11
@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?
Attached Thumbnails
Click image for larger version

Name:	Schermafbeelding 2025-04-03 104036.png
Views:	52
Size:	42.7 KB
ID:	214820  
DrChiper is offline   Reply With Quote
Old 04-03-2025, 05:30 AM   #8
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Quote:
Originally Posted by DrChiper View Post
@Kovid: Maybe an addition for the already existing Reports function (Editor>Tools>Reports)?
Yea, I already asked for that. Would be nice...
See my link in the second post of this thread.
Karellen is offline   Reply With Quote
Old 04-03-2025, 05:36 AM   #9
DrChiper
Bookish
DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.
 
DrChiper's Avatar
 
Posts: 1,017
Karma: 2003162
Join Date: Jun 2011
Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11
Ah, I missed that. Well, Kovid seems to have made his point already then.
DrChiper is offline   Reply With Quote
Old 04-03-2025, 07:48 AM   #10
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
@Karellen : From your other thread, I see this :
Quote:
I am surprised that you think there is too little use for a tag report, as users can very quickly spot tags that need investigating. Personally, I have been caught out so many times- that tag which is used once, and makes you wonder why it is even there.
In that case, you may easily print only tags that have less than x entries, or have a list of exclusion for common tags as body, div, p, and so on.

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        # Exclusions:
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # or () for no exclusion
        # max_it = 5    # no print if more occ than this. None or 0 for no limit
        excl = ()
        max_it = None

        my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}
        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        for key in sorted(my_tags):
            print(f'{key}: {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage
Note : You may add in the same way a filter "min_it", although I don't see a use for it
Note : I've suppressed the error test if not data since we must have at least an html tag in a valid epub

Last edited by lomkiri; 04-04-2025 at 07:14 AM. Reason: max_it can be 0 or None for no limit of occurrences
lomkiri is offline   Reply With Quote
Old 04-03-2025, 02:36 PM   #11
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Modified: "max_it" may be set to 0 or None for no limit of occurrences.
Not really necessary, but it's cleaner than putting max_it = 1000000 :-)

Last edited by lomkiri; 04-03-2025 at 02:40 PM.
lomkiri is offline   Reply With Quote
Old 04-04-2025, 03:16 PM   #12
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,611
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Thanks @lomkiri, I'll check this out also.
Karellen is offline   Reply With Quote
Old 04-09-2025, 09:30 PM   #13
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Thanks @lomkiri

Your first function -the only one I tried- works quite fine. I added it to my saved searches.

If we wish to refine further, for example look after all these p tags, we then can use the Reports Calibre tool which gives us the number of classes.

Last edited by roger64; 04-09-2025 at 09:38 PM.
roger64 is offline   Reply With Quote
Old 04-10-2025, 08:04 AM   #14
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by roger64 View Post
Your first function -the only one I tried
The two functions give exactly the same result if we keep the default parameters (exc=() and max_it=None).

The second form just gives you the ability to filter out some tags without any interest, as <title>, <html> or <body>, or/and to catch tags with few occurrences (as Karellen was wishing).

Quote:
we then can use the Reports Calibre tool
It's a great tools, and much better integrated than this regex-function that is only a work-around. It is clickable at the first place, and points you to the definition of the tag or the class in the css. It gives you also the count for the chaining of some tags (e.g. "div, p") if in the css.

But it doesn't give the same results. For example, if you have a tag <img> in your text but not in the css, it will appear with my function but not in the report. Thus, this function will better fit the Karellen's needs (a tag wrongly written, for example). Or if we need a raw and stupid list of all tags, as was asking Urnoev.

Quote:
If we wish to refine further, for example look after all these p tags
It is also possible to add a parameter "incl" in this function, to select only some tags. This param, if not empty, will have precedence over excl (if incl AND excl are defined, only incl will be considered).

If incl = excl = () and max_it = 0, no filter will be applied.

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)
    """
    # last passage
    if match == None:

        #### Parameters for Filters: ####
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = (img, svg)    # () for no inclusion
        # Exclusion of some tags, e.g. : 
        # excl = ('html', 'meta', 'body', 'title', 'div', 'p')    # () for no exclusion
        # no display if more occ than max_it, e.g:
        # max_it = 5    # None or 0 for no limit

        incl = ()        # () for no filter
        excl = ()        # () for no filter
        max_it = None    # 0 or None for no filter
        sort = 'name'    # None | 'name' | 'number' (any other value will sort by name)
        reverse = False  # Reverse sorting if True
        #####

        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}

        print(f'Found a total of {number} tags, with {len(data)} different tags')
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) < len(data):
            print(f'Selected a total of {sum(my_tags.values())} tags, with {len(my_tags)} different tags')
        
        if sort == None:
            ind = my_tags.keys()    # order of appearance
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:    # ordered by name
            ind  = sorted(my_tags, reverse=reverse)
        
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage
Edit: I've added 2 parameters to sort the output by name or number, reversed or not.
With the default parameters, this version gives the same result than the first version

Last edited by lomkiri; 04-10-2025 at 12:16 PM. Reason: Added: possibility to sort the result by number of occ.
lomkiri is offline   Reply With Quote
Old 04-10-2025, 12:20 PM   #15
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Added a param. for sorting by name (default), by number of occurrences, or by order of appearance
Added a param. for reversed sorting
With the default parameters, the function behaves as the first version.
lomkiri is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex to remove html tags vijer Sigil 8 04-16-2021 03:05 PM
html tags always have a class? hobnail Workshop 4 04-20-2020 02:18 PM
Comparison of supported html and css tags in various ebook formats GrannyGrump Workshop 2 07-11-2016 09:33 PM
HTML input plugin stripping text within toc tags in child html file nimblebooks Conversion 3 02-21-2012 03:24 PM
Problem with html -> Mobi conversion - html tags visible. khromov Calibre 9 08-06-2011 11:25 AM


All times are GMT -4. The time now is 07:33 PM.


MobileRead.com is a privately owned, operated and funded community.