Regex examples - Page 26

eschwartz · 08-08-2014, 12:33 PM

Quote:

Originally Posted by mzmm

is that working for you in sigil? my PCRE editor doesn't match

i'd probably use something like

find

Code:

(?<![.!?])( [A-Z])(?=[a-z])

replace

Code:

\U\1\E

but yes, some examples of exactly want to match would help, it's a little unclear

Yes, I forgot some things, like the part where lookbehinds need to look behind.

Find:

Code:

(?<![.!?])(?<=[ ])([A-Z])(?=[a-z]+)

Replace.

Keep in mind that \E -- end of modifier's action -- is not strictly necessary if the entire replacement is being flagged as lowercase:

Code:

\L\1

This time I actually checked in Sigil.

You lost a space. Also, you imitated my mistake of offering an uppercasing solution (for letters that are already uppercase

) instead of lowercasing. I blame my dental surgery, what's your excuse?

(You can blame it on me, I did trick you.

)

DiapDealer · 08-08-2014, 12:55 PM

Might want to replace those old-fashioned [A-Z][a-z] classes with something more unicode-friendly (such as \p{L} and its uppercase/lowercase variants). We're not regexing in an ascii-only world anymore. Even in English texts.

eschwartz · 08-08-2014, 01:05 PM

Quote:

Originally Posted by DiapDealer

Might want to replace those old-fashioned [A-Z][a-z] classes with something more unicode-friendly (such as \p{L} and its uppercase/lowercase variants). We're not regexing in an ascii-only world anymore. Even in English texts.

I am an american heathen who doesn't know anything about this "foreign language" stuff. ascii is all that matters.

I could do \w if you want.

DiapDealer · 08-08-2014, 01:28 PM

Quote:

Originally Posted by eschwartz

I could do \w if you want.

Which would only include non-ascii characters if regex's Unicode switch is turned on (and even then, it's going to include digits and underscores as well. No... if you're looking to match only letters--but even those letters used to spell naive and facade correctly--it's \p{L} you're gonna want.

eschwartz · 08-08-2014, 01:37 PM

Quote:

Originally Posted by DiapDealer

Which would only include non-ascii characters if regex's Unicode switch is turned on (and even then, it's going to include digits and underscores as well. No... if you're looking to match only letters--but even those letters used to spell naive and facade correctly--it's \p{L} you're gonna want.

See above, under american heathen.

I will keep it in mind for next time.

DiapDealer · 08-08-2014, 01:44 PM

Hey, I'm American too! I just happened to notice the phrase "older German grammar" in the original request. Don't want to give advice that might cause them to miss the very stuff they were looking for do we?

eschwartz · 08-08-2014, 01:57 PM

Quote:

Originally Posted by DiapDealer

Hey, I'm American too! I just happened to notice the phrase "older German grammar" in the original request. Don't want to give advice that might cause them to miss the very stuff they were looking for do we?

And I learned something new myself, today.

Fortunately, the easiest part should be replacing the characters to search for with a bigger set. My expertise accurately targeted what I know about, which is the framework behind the search. On which note, we really need calibre editor macros already, since calibre doesn't seem to support the full PCRE.

DiapDealer · 08-08-2014, 03:18 PM

Quote:

Originally Posted by eschwartz

On which note, we really need calibre editor macros already, since calibre doesn't seem to support the full PCRE.

It supports a whole big bunch of it. Anything missing is mainly on the replacement side of things--like the (upper|lower)case thing (which--lets face it--is pretty specialized/gimmicky to begin with). The calibre editor's regex engine also gains us a few things that other regex flavors don't have, like variable length lookbehinds and the short-hand classes \m \M (which match the beginning and end of words respectively), as opposed to just the \b (word boundary).

The only thing I REALLY miss in calibre's regex ATM is the \K functionality. I'm not entirely sure why it's unavailable--since it's certainly part of the Barnett Python regex module that it's employing for its editor's S&R.

eschwartz · 08-08-2014, 03:30 PM

Never seen \K before.

Seems like it is useful mainly as a replacement for lookbehind assertions (while still capturing stuff!

)

DiapDealer · 08-08-2014, 04:33 PM

Quote:

Originally Posted by eschwartz

Never seen \K before.

Seems like it is useful mainly as a replacement for lookbehind assertions (while still capturing stuff!

)

Actually, I could be wrong. \K may not be a part of the regex module being used in calibre. I will miss it very much if so.

But then again... with variable-length lookbehind assertions allowed, it may not be all that hard to replicate \K's functionality!

I've just always hated remembering the lookbehind syntax:
When using the string 'hhhhhhhhhhhhhhhhhhhhhhd':
It was always easier to search for h+\Kd in Sigil (provided finding a 'd' that follows a potentially unknown number of 'h's was as vitally important to you as it is to me!

). Beside the fact that (?<=h+)d wouldn't fly in Sigil, the (?<=) and (?<!) hokum of lookbehinds was (and still is) always difficult for me to remember on the fly. I find it terribly unintuitive.

But now that (?<=h+)d WILL work in calibre's editor ... the \K isn't AS vitally important to me--provided I get over the mental stumbling block of remembering the (positive|negative) look(ahead|behind)'s syntax.

So with the exception of the case alteration on replacements, what are you finding you can't do in calibre's regex S&R that you could in Sigil's?

eschwartz · 08-08-2014, 04:36 PM

I hear ya!

I just forgot to make lookbehinds look behind today.

New tool for the Sigil toolkit, at least.

Leonatus · 08-10-2014, 03:13 AM

@eschwartz,
@mzmm:

as to the quotes: In a direct speech, for example:

"Du, Du willst doch nur ...",

the first upper case 'Du' should be maintained, whereas the second should be lower case.

But if the sentence is:

"Blah, blah, blah", sagte Er, "Du, Du willst doch nur...",

all the personal pronouns ('Er', 'Du') should be lower case, for the direct speech is only continued; in the first example it is starting the sentence.

BTW: Instead of " ", I use right and left angled quotation marks (guillemts).

Many thanks so far, and I hope it has become clearer!

eschwartz · 08-10-2014, 03:29 AM

My previous code assumed a word was preceded by a space, but I stuck in a check for NOT opening guillemet. Also, using DiapDealer's unicode codepoints.

Find:

Code:

(?<![.!?«])(?<=[ ])(\p{Lu})(?=\p{Ll}+)

Replace:

Code:

\L\1

Finds a space, capital letter, lowercase letter, assuming it is not proceeded by punctuation types:
.!?«

Does that work? If not, I'll try to come up with something more inspired in the morning.

Leonatus · 08-10-2014, 04:52 AM

For weekend reasons

, I have the text to be treated not available here to test it, but there might some additional clarification be necessary.

Does your proposal not match any uppercase letter in the respective context?

The point is, that nouns in the german language have always been spelled uppercase (at the beginning of the word, of course), also today, and should remain. Whereas, in the former spelling, most of words representing objects or persons, such as pronouns, have been written uppercase, having to be written lowercase following the actual spelling grammar. So, in English it would be like this:

The black Panther was meant to attack Him immediately, but He jumped quickly aside beyond the Wall.

Thus, the "Panther" and the "Wall" should remain uppercase, but "He" and "Him" should turn lowercase.

I hope the problem I have became clearer.

mzmm · 08-10-2014, 09:52 AM

Quote:

Originally Posted by Leonatus

... Does your proposal not match any uppercase letter in the respective context?

The point is, that nouns in the german language have always been spelled uppercase

ah, right, German Nouns. think i answered too quickly the first time.

but yes, the regex would match all uppercase words.

there's going to be some issues with a regex that only catches pronouns, for a few reasons i think; one is that the formal Sie/Ihnen should remain uppercase, whereas sie (she) or ihnen (them) should be converted to lower case.

also, if one is referring to God, i'm uncertain as to weather that would constitute an uppercase Du, or lowercase du, so you may have to be aware of the context there.

anyway, i'd maybe suggest trying something like this:

Code:

(?<![.!?])(\s«?)(Ich|Mich|Mir|Du|Dich|Dir|Er|Ihn|Ihm|Ihr|Es|Wir|Uns|Euch)\b

and then replacing with

Code:

\1\L\2

the first capturing group (\s«?) is looking for a space that may be followed by a «.

unfortunately, you'd then need to go through the text searching for

Code:

(?<![.!?])(\s«?)(Sie|Ihnen)\b

and replacing with

Code:

\1\L\2

or just skipping over it based on the context of the sentence (formal Sie or female sie)

also this wouldn't take into account reflexive or possessive pronouns, i.e. meines, deines, seines, ihres, seines etc, but you didn't mention that these were also uppercased.

in case they are, then you'd want to add them into the second capturing group separated by a pipe | with the other words. the regex is going to get increasingly complex and brittle if you do need to include all relative, demonstrative, interrogative, etc pronouns, and may in the end not be possible to use reliably.

so, maybe that helps?

here's a link to an online editor in case you want to try some more stuff out

http://regex101.com/r/lI3yN2/2

08-10-2014, 03:29 AM	#388
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	My previous code assumed a word was preceded by a space, but I stuck in a check for NOT opening guillemet. Also, using DiapDealer's unicode codepoints. Find: Code: (?<![.!?«])(?<=[ ])(\p{Lu})(?=\p{Ll}+) Replace: Code: \L\1 Finds a space, capital letter, lowercase letter, assuming it is not proceeded by punctuation types: .!?« Does that work? If not, I'll try to come up with something more inspired in the morning. Last edited by eschwartz; 08-10-2014 at 03:34 AM.

08-10-2014, 04:52 AM	#389
Leonatus Wizard Posts: 1,027 Karma: 11123121 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	For weekend reasons, I have the text to be treated not available here to test it, but there might some additional clarification be necessary. Does your proposal not match any uppercase letter in the respective context? The point is, that nouns in the german language have always been spelled uppercase (at the beginning of the word, of course), also today, and should remain. Whereas, in the former spelling, most of words representing objects or persons, such as pronouns, have been written uppercase, having to be written lowercase following the actual spelling grammar. So, in English it would be like this: The black Panther was meant to attack Him immediately, but He jumped quickly aside beyond the Wall. Thus, the "Panther" and the "Wall" should remain uppercase, but "He" and "Him" should turn lowercase. I hope the problem I have became clearer.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM

08-08-2014, 12:55 PM	#377
DiapDealer Grand Sorcerer Posts: 27,614 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Might want to replace those old-fashioned [A-Z][a-z] classes with something more unicode-friendly (such as \p{L} and its uppercase/lowercase variants). We're not regexing in an ascii-only world anymore. Even in English texts.

08-08-2014, 01:44 PM	#381
DiapDealer Grand Sorcerer Posts: 27,614 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Hey, I'm American too! I just happened to notice the phrase "older German grammar" in the original request. Don't want to give advice that might cause them to miss the very stuff they were looking for do we?

08-08-2014, 03:30 PM	#384
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Never seen \K before. Seems like it is useful mainly as a replacement for lookbehind assertions (while still capturing stuff! )

08-08-2014, 04:36 PM	#386
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	I hear ya! I just forgot to make lookbehinds look behind today. New tool for the Sigil toolkit, at least.

08-10-2014, 03:13 AM	#387
Leonatus Wizard Posts: 1,027 Karma: 11123121 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	@eschwartz, @mzmm: as to the quotes: In a direct speech, for example: "Du, Du willst doch nur ...", the first upper case 'Du' should be maintained, whereas the second should be lower case. But if the sentence is: "Blah, blah, blah", sagte Er, "Du, Du willst doch nur...", all the personal pronouns ('Er', 'Du') should be lower case, for the direct speech is only continued; in the first example it is starting the sentence. BTW: Instead of " ", I use right and left angled quotation marks (guillemts). Many thanks so far, and I hope it has become clearer!