Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 02-21-2025, 05:13 AM   #76
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Quote:
Originally Posted by maxthegold View Post
Slight problem with the code on the previous post, here is the corrected version.

Code:
        # Do it again for capitalised words    
        m, o = regex.subn(rgx.format(key.capitalize()), new_name.capitalize(), m)
        if do_count and o:
            if n == 0:
                data['counters'][key] = 0
            data['total'] += o
            data['counters'][key] = data['counters'].get(key) + o
Rookie error, not used to Python.
Thanks for your input on this maxthegold. Your improvement saves a lot of time when adding entries to my list and now I only have to run the one function.
moldy is offline   Reply With Quote
Old 02-22-2025, 04:27 PM   #77
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
To maxthegold and moldy:

In that way, you'll call the function capitalize() a lot, and you create 2 more tests in the loop, it's more efficient to modify directly the dict, automatically adding in it the capitalized form. In this way, you call the function only once for each word, at the initialization, it will be more efficient (and quicker on big ebooks).

And, as I put the possibility of capitalization as a parameter (True by default), I prefer not to put this test in the loop but in the init, for the same reason.

Furthermore, if "Paul" (for example) is in your dict, your code will search twice for it ("Paul" == "Paul".capitalize()), which is not the case in the code I propose.

Of course, the regex for those words, if provided, will be the same for lower case and capitalized case. If you don't want that for some words, put explicitly both forms in the dict

I have modified the code in the message 63 to include this feature. I also added a counter with the number of entries including the capitalized words in the dict.

Last edited by lomkiri; 02-22-2025 at 07:51 PM.
lomkiri is offline   Reply With Quote
Old 02-28-2025, 06:59 AM   #78
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Thanks once again for your input on this lomkiri (and also to maxthegold).
I'm away for 2 weeks so can't try it out but I'll let you know how it goes when I get back home.
moldy is offline   Reply With Quote
Old 04-11-2025, 09:49 PM   #79
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Display the number of occurrences of each html tag in all text files

Original discussions here and here.

In those discussions, it was asked a feature in calibre to know the number of occurrences for each html tag in the text files. The tool "Reports" doesn't give this information if a tag is not in the css. As Kovid thinks it's not a useful feature, I proposed a simple work-aronud with a regex function, not as practical as the tool "reports", of course. I publish it here for it may be of some use for others.

I put two versions, a very simple that gives the result for all tags found, ordered by name. The logic in it appears clearly.
And a slightly more complex with some parameters (filters for tags or max values, chosen order). With the default value of the parameters, it gives the same results as the simpler one.
The parameter "max_it" is there so it's easier to locate erroneous tag names (attends the example given by Karellen is his feature request to Kovid)

find : <(\w+)
replace : one of the two functions below (prefer the second one, more powerful)
Click on "Replace all", so you 'll get all the tags of the epub.
The dialog box gives a number of modifications (with is the number total for all tags) but the files are not modified (although the button "save epub" will be enabled).

The bare function, without parameters (will print all occ. for all tags):
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    search regex: <(/w+)    
    """
    
    # last passage
    if match == None:
        print(f'Found a total of {number} tags, with {len(data)} different tags\n')
        for key in sorted(data):
            print(f'{key}: {data[key]}')
        return
    
    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage
The same function, but with some possibility of filter and sorting:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Count the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    search regex: <(/w+)    
    """
	
    def plural(word, n):
        return word + ('s' if n >1 else '')

    # last passage
    if match == None:

        #### Parameters for Filters: incl, excl, max_it. Also: sort, reverse
        # No filter at all if excl = incl = (), and if max_it = None
        # Include only some tags (if defined, will deactive any exclusion). E.g.:
        # incl = [img, svg]    # [] for no inclusion
        # Exclusion of some tags, e.g.: 
        # excl = ['html', 'meta', 'body', 'title', 'div', 'p']    # [] for no exclusion
        # no display if more occurrences than max_it, e.g.:
        # max_it = 5    # None or 0 for no limit
        # Sort by name or by number of occ. :  None | 'name' | 'number' (any other value will sort by name)
        # sort = 'number'    # None | 'name' | 'number' (any other value will sort by name)
        # reverse = False    # reverse order : False or True

        incl = []        # () for no filter, ('div',) for only one tag
        excl = []        # () for no filter
        max_it = None    # 0 or None for no limit
        sort = 'name'    # None for no sorting
        reverse = False
        #####
        
        # Prepare the print of the filters (if any), for information:
        print_param = []
        if incl:
            print_param.append('Include only those tags: ' + ', '.join(incl))
        if excl:
            print_param.append('Exclude those tags: ' + ', '.join(excl))
        if max_it:
            print_param.append(f"Don't print tags with more than {max_it} {plural('occurrence', max_it)}")
            
        # counting by tag
        if incl:
            my_tags = {k: v for k, v in data.items() if k in incl and (not max_it or v <= max_it)}
        else:
            my_tags = {k: v for k, v in data.items() if k not in excl and (not max_it or v <= max_it)}

        # print headers
        print(f'Found a total of {number} {plural("occurrence", number)} and {len(data)} different {plural("tag", len(data))}')
        if print_param:
            print(6*' ' + '\n      '.join(print_param))
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) == 0:
            print('No occurrences found with those criterias')
        elif len(my_tags) < len(data):
            ntags = sum(my_tags.values())
            print(f'Selected a total of {ntags} {plural("occurrence", ntags)} and {len(my_tags)} different {plural("tag", len(my_tags))}')
        print('')
        
        if sort == None:
            ind = my_tags.keys()
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]), reverse=reverse)
        else:
            ind  = sorted(my_tags, reverse=reverse)
 
        # Print the occurrences by tag
        for key in ind:
            print(f'{key} : {my_tags[key]}')
        return
    # End of last passage

    # normal passage
    tag = match[1]
    data[tag] = data.setdefault(tag, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage
The result (with no parameter):
Code:
Found a total of 2613 occurrences and 14 different tags

a : 41
body : 20
br : 7
div : 56
h1 : 11
[ect.]

Last edited by lomkiri; 04-12-2025 at 07:07 AM. Reason: typos
lomkiri is offline   Reply With Quote
Old 04-13-2025, 08:40 AM   #80
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Display the number of occurrences of each html tag in all text files

New version with an option (in the parameters) for printing the list of the impacted files below each tag.
It tries to mimic the tool "Reports", but of course the list is not clckable

N.B. : Each file-list starts with the currently edited file, so it's a good idea to display the first file before to "replace all", if you're asking for the file lists.

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """ 2025-04-13
    Counts the number of occurrences for every html tag in an epub
    May be filtered by tag name and by max number of occ.
    Option for listing the impacted files
    
    search regex: <(/w+)    
    """
    
    def plural(word, n):
        return word + ('s' if n >1 else '')
    
    # last passage
    if match == None:

        #### Parameters ###
        # No filter at all if excl = incl = (), and if max_it = None
        #
        # Include only some tags (if defined, deactives any exclusion). E.g.:
        #   incl = [img, svg]  # [] for no inclusion
        # Exclusion of some tags, e.g.:
        #   excl = ['html', 'meta', 'body', 'title', 'div', 'p']    # [] for no exclusion
        # no display if more occurrences than max_it, e.g:
        #   max_it = 5        # None or 0 for no limit
        # Sorting:
        #   sort = 'name'     # 'name' | 'number' | None or '' (any other value will sort by name)
        #   reverse = False   # Reverse sorting if True
        # Optional file list:
        #    showfiles = True # For each tag, show the affected files with the number of occ.
        #                     # This list starts with the file currently displayed	

        incl = []           # () for no filter, ('div',) for only one tag
        excl = []           # () for no filter
        max_it = 0          # 0 or None for no filter
        sort = 'name'       # None or '' for no sorting
        reverse = False
        showfiles = False   # False for no file list
        #####
        
        # Prepare the print of the parameters, if any:
        print_param = []
        sorting = 'List orderted by ' + ('natural order' if not sort
                                        else 'number of occurrences' if sort.lower() == 'number'
                                        else 'name')
        sorting += ' (reversed order)' if sort and reverse else ''
        if incl:
            print_param.append('Include only those tags: ' + ', '.join(incl))
        if excl:
            print_param.append('Exclude those tags: ' + ', '.join(excl))
        if max_it:
            print_param.append(f"Don't print tags with more than {max_it} {plural('occurrence', max_it)}")
        if  showfiles:
            print_param.append('Print also the list of the impacted files (starts at the displayed file)')
            
        # counting by tag
        if incl:
            my_tags = {k: d for k, d in data.items() if k in incl and (not max_it or d['numtags'] <= max_it)}
        else:
            my_tags = {k: d for k, d in data.items() if k not in excl and (not max_it or d['numtags'] <= max_it)}

        # print headers
        print(f'Found a total of {number} {plural("occurrence", number)} and {len(data)} different {plural("tag", len(data))}')
        if print_param:
            print(6*' ' + '\n      '.join(print_param))
        if incl and excl:
            print('You have defined inclusions AND exclusions. Only inclusions have been treated')
        if len(my_tags) == 0:
            print('No occurrences found with those criterias')
        elif len(my_tags) < len(data):
            nums = [my_tags[k]['numtags'] for k in my_tags]
            ntags = sum(nums)
            print(f'Selected a total of {ntags} {plural("occurrence", ntags)} and {len(my_tags)} different {plural("tag", len(my_tags))}')
        print(sorting)
        print('')
        
        if not sort:
            ind = my_tags.keys()
        elif sort.lower() == 'number':
            ind = sorted(my_tags, key=(lambda k: my_tags[k]['numtags']), reverse=reverse)
        else:
            ind = sorted(my_tags, reverse=reverse)
 
        # Print the occurrences by tag
        for key in ind:
            print(f'{key} : {my_tags[key]["numtags"]}')
            if showfiles:
                for f in my_tags[key]["files"]:
                    print(f'{6*" "} {f} : {my_tags[key]["files"][f]}')
        return
    # End of last passage

    # normal passage
    tag = match[1]
    data[tag] = data.get(tag, {})
    data[tag]['numtags'] = data[tag].setdefault('numtags', 0) +1
    data[tag]['files'] = data[tag].get('files', {})
    data[tag]['files'][file_name] = data[tag]['files'].setdefault(file_name, 0) +1
    return match[0]

replace.call_after_last_match = True    # Ask for last passage

Last edited by lomkiri; 04-14-2025 at 11:14 AM.
lomkiri is offline   Reply With Quote
Old 04-18-2025, 07:45 AM   #81
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
I absolutely see your point, but at the time, I had over a thousand entries in my JSON file, and I am lazy. I wasn't particularly concerned about the added processing time. I like your solution, though, as I now have about 1500 JSON entries. I'll give it a try. Keep up the excellent work.
maxthegold is offline   Reply With Quote
Old 04-18-2025, 07:35 PM   #82
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
Quote:
Originally Posted by lomkiri View Post
To maxthegold and moldy:

In that way, you'll call the function capitalize() a lot, and you create 2 more tests in the loop, it's more efficient to modify directly the dict, automatically adding in it the capitalized form. In this way, you call the function only once for each word, at the initialization, it will be more efficient (and quicker on big ebooks).

And, as I put the possibility of capitalization as a parameter (True by default), I prefer not to put this test in the loop but in the init, for the same reason.

Furthermore, if "Paul" (for example) is in your dict, your code will search twice for it ("Paul" == "Paul".capitalize()), which is not the case in the code I propose.

Of course, the regex for those words, if provided, will be the same for lower case and capitalized case. If you don't want that for some words, put explicitly both forms in the dict

I have modified the code in the message 63 to include this feature. I also added a counter with the number of entries including the capitalized words in the dict.
I absolutely see your point, but at the time, I had over a thousand entries in my JSON file, and I am lazy. I wasn't particularly concerned about the added processing time. I like your solution, though, particularly as I now have about 1500 JSON entries. I'll give it a try. Keep up the excellent work lomkiri.
maxthegold is offline   Reply With Quote
Old 04-25-2025, 08:47 AM   #83
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Hi lomkiri and maxthe gold.
I've been successfully using the latest version of the function and it works brilliantly for me, saving a lot of time.
However, a recent experience has highlighted a missing 'nice to have' but in no way essential feature.
In a lot of ebooks the first 3 words of a chapter and sometimes text breaks are fully capitalised e.g JOHN AND PAUL wrote many songs .....
Also there are often fully capitalised words in the text for various reasons.The function, as is, ignores these words.
Would it be possible to add a feature to the function to include fully capitalised words?
Thanks once again for your work - moldy
moldy is offline   Reply With Quote
Old 04-26-2025, 04:50 AM   #84
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by moldy View Post
Would it be possible to add a feature to the function to include fully capitalised words?
Done, the function in the message 63 has been modified.
I added also the possibility to do a search with case insensitive, this option obviously nullify an eventual param for capitalisation or uppercase.

Last edited by lomkiri; 04-26-2025 at 03:40 PM.
lomkiri is offline   Reply With Quote
Old 05-02-2025, 06:16 AM   #85
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Thanks for the modified function lomkiri - your work and knowledge is much appreciated.
I tested the new version using the John => Mick example. I ignored the mixed capitalisation (jOHn) because that occurrs so rarely it's not a problem. My test was:
Code:
<p>JOHN John</p>
and the desired return would have been:
Code:
<p>MICK Mick</p>
However the actual return was:
Code:
<p>mick mick</p>
.
Obviously where, say, the first 3 words of a text section are capitalised or, as another example, in a capitalised heading, this change would not be acceptable.
I'm guessing I did not make my change request sufficiently clear in what I was asking for. - moldy

Last edited by moldy; 05-02-2025 at 06:28 AM. Reason: Typo
moldy is offline   Reply With Quote
Old 05-02-2025, 11:59 AM   #86
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Are you sure you've set the parameters to :
case_sensitive = True
do_capitalize = True
do_uppercase = True

I've just tested with those param and with "paul": "keith" in change_words,json :
"Paul PAUL paul" gives "Keith KEITH keith", as expected.

You probably have set case_sensitive = False, then you always obtain "keith" independently of the case of the entry: Paul PAUL paul" gives "keith keith keith"
lomkiri is offline   Reply With Quote
Old 05-03-2025, 04:49 AM   #87
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
My apologies lomkiri I had misread the instructions.
Withe all parameters set to 'True' the function works perfectly.
Hmmmm...If only I could think of something else to add to the function
moldy is offline   Reply With Quote
Reply

Tags
conversion, errors, function, ocr, spelling


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
About saved searches and regex Carpatos Editor 22 09-30-2020 10:56 PM
Regex-Functions - getting user input CalibUser Editor 8 09-09-2020 04:26 AM
Difference in Manual Search and Saved Search phossler Editor 4 10-04-2015 12:17 PM
Help - Learning to use Regex Functions weberr Editor 1 06-13-2015 01:59 AM
Limit on length of saved regex? ElMiko Sigil 0 06-30-2013 03:32 PM


All times are GMT -4. The time now is 06:03 PM.


MobileRead.com is a privately owned, operated and funded community.