Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 08-14-2025, 04:00 AM   #1
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
How to write regex function which uses dictionary?

I have a book which has hyphens instead of em dashes, and I'm trying to fix it. Using a regex like "-(and|but|with)" catches a few cases, but it would be better to find all "\w+-\w+" which are not in the current dictionary, which would catch about 99% of all cases (and leave the remainder to the proofreading stage).

How could I write a regex function to do this?
Phssthpok is offline   Reply With Quote
Old 08-14-2025, 04:19 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,497
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://manual.calibre-ebook.com/fun...phenated-words
kovidgoyal is offline   Reply With Quote
Old 08-15-2025, 02:21 AM   #3
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
Quote:
Originally Posted by kovidgoyal View Post
Aha! Thank you!
Phssthpok is offline   Reply With Quote
Old 08-17-2025, 10:08 AM   #4
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
Quote:
Originally Posted by Phssthpok View Post
Aha! Thank you!
Hmm, having done this I realise that what I really need is a way to *find* things which are not in the dictionary, not all hyphenated words -- since there gazillions of those to wade through, and far fewer which are non-dictionary items.

Seems to be no way to do this. Plugin maybe? Back to the drawing board...
Phssthpok is offline   Reply With Quote
Old 08-18-2025, 04:45 PM   #5
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 172
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by Phssthpok View Post
Hmm, having done this I realise that what I really need is a way to *find* things which are not in the dictionary, not all hyphenated words.
Try to replace the sub-function replace_word() that is inside the code of the example by this one, that does a replace when the compound word is NOT is the dict:
Code:
    def replace_word(wmatch):
        # if word1-word2 is not is not recognized by the dictionary, replace dash by em-dash
        with_em_dash = wmatch.group(1) + "—" + wmatch.group(2)
        if not dictionaries.recognized(wmatch.group()):
            return with_em_dash
        return wmatch.group()
lomkiri is offline   Reply With Quote
Old 08-22-2025, 05:26 AM   #6
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
Quote:
Originally Posted by lomkiri View Post
Try to replace the sub-function replace_word() that is inside the code of the example by this one, that does a replace when the compound word is NOT is the dict
No, I did this -- the problem with this code is that I find every occurrence of a hyphenated word and then whether to replace it or not.

What I really want is a way to FIND the ones not in the dictionary and THEN decide whether to replace them. Out of 500 hyphenated words, maybe only 50 will need to be looked at as candidates for replacement, instead of looking at all 500. But I can't see any way to do that, so I'll just have to look at all 500.
Phssthpok is offline   Reply With Quote
Old 08-26-2025, 01:49 PM   #7
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 172
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by Phssthpok View Post
What I really want is a way to FIND the ones not in the dictionary and THEN decide whether to replace them.
You could modify the regex-function so it selects the candidates and writes them down into a file (e.g.. not_in_dict.txt) without changing anything in your text.

Then you open this file in a text-editor and delete from it all but the occurrences you want to correct

Then you modify the regex-function to load this new file in a list, and to check each occurrence against this list : if it's present in the list, correct it in the text

Note: It could be more convenient to delete from the file only the occurrences you want to correct (deleting only more or less 50 instead of 450), in that case you'll have to do the contrary: correct only the ones which are NOT in the list.

Last edited by lomkiri; 08-26-2025 at 01:53 PM.
lomkiri is offline   Reply With Quote
Old 08-29-2025, 05:57 AM   #8
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 596
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
Quote:
Originally Posted by lomkiri View Post
You could modify the regex-function so it selects the candidates and writes them down into a file (e.g.. not_in_dict.txt) without changing anything in your text.

Then you open this file in a text-editor and delete from it all but the occurrences you want to correct

Then you modify the regex-function to load this new file in a list, and to check each occurrence against this list : if it's present in the list, correct it in the text

Note: It could be more convenient to delete from the file only the occurrences you want to correct (deleting only more or less 50 instead of 450), in that case you'll have to do the contrary: correct only the ones which are NOT in the list.
Ah, that's a good idea! I'll give it a whirl.

Of course, for the book that made me think about this issue, I have already gone through the 500 instances by hand and corrected the 50 that were wrong... but I'm sure it'll happen to me again!
Phssthpok is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex in Regex function mode lindlind Editor 5 03-22-2024 03:41 AM
Help with S&R RegEx Function MerlinMama Editor 5 05-29-2022 02:23 AM
Predefined regex for Regex-function sherman Editor 3 01-19-2020 05:32 AM
regex function replacement The_book Sigil 5 12-09-2019 09:45 AM
Regex Function about «» and “” senhal Editor 8 04-06-2016 02:12 AM


All times are GMT -4. The time now is 01:43 AM.


MobileRead.com is a privately owned, operated and funded community.