View Single Post
Old 01-19-2022, 08:43 PM   #5
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by lomkiri View Post
- Selecting whole sentences:
You could add space and comma to your search string:
Code:
\b(\p{Lu}[\p{Lu}\s,-]+)\b
(note: \p{Lu} has the same meaning than [[:upper:]], you may use one or the other)
In this case, words like JOHN or FIFA will be targeted and transformed.
If an acronym with dots (F.I.F.A.) is inside the sentence, the selection will stop when reaching it.
I just tried this and it picked up "I" by itself. Which I suppose if you are using title case for the matched words works, but, not if you are just changing them to lower case. But, it does feel wrong as it basically gives a lot of false positives. I tried:

Code:
\b(\p{Lu}{2}[\p{Lu}\s,-]*)\b
That didn't pick up "I" by itself but it missed "I'M". (The book I tested on had a few "I'm GOING TO..." with the action dependent on exactly how angry they were were. I didn't notice it when I read it as it worked.)
Quote:
- Excluding from the transformation the words not recognized by the dictionary:
Use the search string David gave you:
\b([[:upper:]]{2,})\b
with this regex-function:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):    
    word = match.group(1)
    if dictionaries.recognized(word):
        return word[0] + word[1:].lower()
    return word
This will transform only the recognized words. The last "return" leaves the non-recognized words as they are, it's up to you to do another treatment on them.

- You have another possibility, it's to write into a temp file all the capitalized words not dict-recognized, and decide what you want to do with them (you can do that in a regex-function ; you could store them in a python set, and write the set on the last passage of the function)

If you want a more refined treatment, you'll have to imagine how you can lead with the exceptions and translate that logic into your regex-function
The issue I had there was that "USA" is in the dictionary. And I added "FIFA" to the ignored words and that meant it was in the dictionary. And the book I tested on had "CPR", "SOS", "TV" and a few others.
Quote:
Suggestion: you could also surround the whole capitalized sentence with the tag <small>SENTENCE</small>, it will be much less aggressive, small-caps are often used as an acceptable emphasis. You can do that modifying slightly the regex-function I wrote above.
Or use a span with a transform to lower case or capitalize.


@Peter Blaise: As to automating this, I really don't think that is a good idea. There are far to many exceptions to the rule. Your best bet is not to do it from the spelling checker. Use the search, look at the words and then decide if you want to change it or skip to the next one.

And for the record, this is purely a technical exercise to me. I don't think changing a book in this way makes sense. If the author does this, it should be deliberate and for the emphasis. If they overdo it, there are usually other problems in the book and they are generally worse.
davidfor is offline   Reply With Quote