MobileRead Forums - View Single Post

Psymon · 07-14-2016, 05:21 AM

Hope it's okay for a veritable Regex newbie to post a query in this thread -- I'm only just beginning to learn about this stuff, but with any like it'll eventually start sinking in.

I seem to have developed an affinity for doing up electronic versions of "ye olde bookes" -- for example, right now I'm doing up several Shakespeare plays in the original Elizabethan English, endeavouring to give it somewhat of the "look and feel" of early typographic styles, complete with use of the long-ess (i.e. "ſ", the character that looks like an "f" but without the crossbar, and is actually an "s"). Along with the unusual use of the "u" and "v" characters in early typography, where an "ſ" is use instead of "s" has to do with placement within a word, rather than the "sound" of the character or anything else like that.

Very often when I find digital transcriptions of these early texts, they've kept the "u" and "v" oddities, but for some reason have changed all the long-esses to just "s" instead -- and so I have to change them back. The rule for when this is supposed to occur is actually fairly simple (although not all early printers/typographers followed this, but the vast majority did): virtually every instance of "s" should be changed to "ſ" unless it falls at the end of the word, then it remains as "s."

So to fix my texts up, I've been searching for every instance of "s" and then changing it to "ſ" -- which right away causes all my HTML code to need to be fixed up first, because things like "css," "class," "span," etc. get screwed up in the process -- and then I do another series of searches, looking for instances of "ſ" (long-ess) plus a "." or "," or ":" or ";" or "?" or "!" or ")" or "[space]" or "[apostrophe -- curly or otherwise], plus "<" should there be a closing </i> or </p> tag or something, i.e. wherever it might occur at the end of a word, and then changing it back to "s" again.

It's not that big a deal, actually, I can "correct" the long-esses in a whole book in, like, 5 or 10 minutes or so, but it would be totally cool to just whiz it off with one, single regex search, of course.

Oh, and it would have to be case-sensitive, of course -- all instances of upper-case "S" remain as "S."

ALSO...

A similar S&R could also be done on the "u" and "V" characters, the early rules for which also had to do with placement -- although as I mentioned before, most digital transcriptions of early texts seem to have retained those. It could come in handy, though, if at some point I encounter a text that has "modernized" the typography (but not word-spelling) of something.

For those characters, lower-case "v" was used for both "u" and "v" at the start of a word, while "v" was used for both "u" and "v" elsewhere in the world -- thus, the word we spell as "uvula" (that thing that dangles at the back of your mouth/throat) would be spelled rather oddly as "vuula."

As for upper-case "U" and "V," there was only one character, "V" -- although this is very easy to change with a simple, regular S&R, of course.

(Very often the upper-case "W" character -- and occasionally the lower-case "w," too -- would be written as "VV"/"vv," but most often not, it seems to have been essentially dependent on the font the printer had available and not based on any "rule." This is why, however, we call the "w" character "double-u," actually -- in case you ever wondered.)

Anyway, hope that's not too weird -- or, indeed, too basic -- a Regex question for me to ask here. The long-ess part of my query would certainly be really great to have a Regex expression for, though!

Thanks so much, in advance! And thanks for bearing with me here, too, of course, with my long question/explanation.

EDIT/POSTCRIPT: I forgot about "i" and "j"! In early typography, there was only one character for both -- "i" -- although once again that's easy enough to fix up with a regular S&R, of course. The only time "j" was used was as a ligature. For example, in this Elizabethan Shakespeare text I'm working on, the word "allies" (in modern English) came up, which was spelled at that time as "alliis -- and, hence, the "ii" became "ij" ("allijs"). If you look at how it looks, then you can see where we got the character "y" from.

07-14-2016, 05:21 AM	#494
Psymon Chief Bohemian Misfit Posts: 571 Karma: 462964 Join Date: May 2013 Device: iPad, ADE	Hope it's okay for a veritable Regex newbie to post a query in this thread -- I'm only just beginning to learn about this stuff, but with any like it'll eventually start sinking in. I seem to have developed an affinity for doing up electronic versions of "ye olde bookes" -- for example, right now I'm doing up several Shakespeare plays in the original Elizabethan English, endeavouring to give it somewhat of the "look and feel" of early typographic styles, complete with use of the long-ess (i.e. "ſ", the character that looks like an "f" but without the crossbar, and is actually an "s"). Along with the unusual use of the "u" and "v" characters in early typography, where an "ſ" is use instead of "s" has to do with placement within a word, rather than the "sound" of the character or anything else like that. Very often when I find digital transcriptions of these early texts, they've kept the "u" and "v" oddities, but for some reason have changed all the long-esses to just "s" instead -- and so I have to change them back. The rule for when this is supposed to occur is actually fairly simple (although not all early printers/typographers followed this, but the vast majority did): virtually every instance of "s" should be changed to "ſ" unless it falls at the end of the word, then it remains as "s." So to fix my texts up, I've been searching for every instance of "s" and then changing it to "ſ" -- which right away causes all my HTML code to need to be fixed up first, because things like "css," "class," "span," etc. get screwed up in the process -- and then I do another series of searches, looking for instances of "ſ" (long-ess) plus a "." or "," or ":" or ";" or "?" or "!" or ")" or "[space]" or "[apostrophe -- curly or otherwise], plus "<" should there be a closing </i> or </p> tag or something, i.e. wherever it might occur at the end of a word, and then changing it back to "s" again. It's not that big a deal, actually, I can "correct" the long-esses in a whole book in, like, 5 or 10 minutes or so, but it would be totally cool to just whiz it off with one, single regex search, of course. Oh, and it would have to be case-sensitive, of course -- all instances of upper-case "S" remain as "S." ALSO... A similar S&R could also be done on the "u" and "V" characters, the early rules for which also had to do with placement -- although as I mentioned before, most digital transcriptions of early texts seem to have retained those. It could come in handy, though, if at some point I encounter a text that has "modernized" the typography (but not word-spelling) of something. For those characters, lower-case "v" was used for both "u" and "v" at the start of a word, while "v" was used for both "u" and "v" elsewhere in the world -- thus, the word we spell as "uvula" (that thing that dangles at the back of your mouth/throat) would be spelled rather oddly as "vuula." As for upper-case "U" and "V," there was only one character, "V" -- although this is very easy to change with a simple, regular S&R, of course. (Very often the upper-case "W" character -- and occasionally the lower-case "w," too -- would be written as "VV"/"vv," but most often not, it seems to have been essentially dependent on the font the printer had available and not based on any "rule." This is why, however, we call the "w" character "double-u," actually -- in case you ever wondered.) Anyway, hope that's not too weird -- or, indeed, too basic -- a Regex question for me to ask here. The long-ess part of my query would certainly be really great to have a Regex expression for, though! Thanks so much, in advance! And thanks for bearing with me here, too, of course, with my long question/explanation. EDIT/POSTCRIPT: I forgot about "i" and "j"! In early typography, there was only one character for both -- "i" -- although once again that's easy enough to fix up with a regular S&R, of course. The only time "j" was used was as a ligature. For example, in this Elizabethan Shakespeare text I'm working on, the word "allies" (in modern English) came up, which was spelled at that time as "alliis -- and, hence, the "ii" became "ij" ("allijs"). If you look at how it looks, then you can see where we got the character "y" from. Last edited by Psymon; 07-14-2016 at 05:35 AM.