MobileRead Forums - View Single Post

eschwartz · 07-27-2016, 10:40 AM

Quote:

Originally Posted by Psymon

Hope it's okay for a veritable Regex newbie to post a query in this thread -- I'm only just beginning to learn about this stuff, but with any like it'll eventually start sinking in.

Ah, that is exactly what this thread is for.

Certainly we don't expect the people who already know the answer to ask questions...

Quote:

I seem to have developed an affinity for doing up electronic versions of "ye olde bookes" -- for example, right now I'm doing up several Shakespeare plays in the original Elizabethan English, endeavouring to give it somewhat of the "look and feel" of early typographic styles, complete with use of the long-ess (i.e. "ſ", the character that looks like an "f" but without the crossbar, and is actually an "s"). Along with the unusual use of the "u" and "v" characters in early typography, where an "ſ" is use instead of "s" has to do with placement within a word, rather than the "sound" of the character or anything else like that.

Very often when I find digital transcriptions of these early texts, they've kept the "u" and "v" oddities, but for some reason have changed all the long-esses to just "s" instead -- and so I have to change them back. The rule for when this is supposed to occur is actually fairly simple (although not all early printers/typographers followed this, but the vast majority did): virtually every instance of "s" should be changed to "ſ" unless it falls at the end of the word, then it remains as "s."

So to fix my texts up, I've been searching for every instance of "s" and then changing it to "ſ" -- which right away causes all my HTML code to need to be fixed up first, because things like "css," "class," "span," etc. get screwed up in the process -- and then I do another series of searches, looking for instances of "ſ" (long-ess) plus a "." or "," or ":" or ";" or "?" or "!" or ")" or "[space]" or "[apostrophe -- curly or otherwise], plus "<" should there be a closing </i> or </p> tag or something, i.e. wherever it might occur at the end of a word, and then changing it back to "s" again.

It's not that big a deal, actually, I can "correct" the long-esses in a gtwhole book in, like, 5 or 10 minutes or so, but it would be totally cool to just whiz it off with one, single regex search, of course.

Oh, and it would have to be case-sensitive, of course -- all instances of upper-case "S" remain as "S."

[snip -- same thing with other character substitutions]

Case-sensitivity is a setting in the S&R box.

Using the power of lookaround zero-length assertions and word boundary zero-length assertions, the following regex will find a character-that-is-not-at-the-end-of-a-word (in this case "s") that is not inside HTML tags:

Find:

Code:

(?<=>[^<]*)s\B(?=[^>]*<)

Replace: (you guessed this one already, right?)

Code:

ſ

Explanation:

Just check for a tag closing character ">", followed by zero or more characters-that-aren't-a-tag-opener-"<"... wrapped in a lookbehind, so you don't clutter up the actual match.
Followed by a random character -- whatever you are looking for, in this case "s" -- followed by a negated word boundary zero-length assertion "\B".
Followed by zero or more characters-that-aren't-a-tag-closer-">" followed by a tag opener "<"... again wrapped in a lookahead, so you don't clutter up the actual match.