Tyrannosaurus Regex

rogue_ronin · 05-20-2009, 06:04 AM

In a couple of other threads, there was talk about starting a thread for useful regex. Stuff we use all the time, or stuff to solve a difficult, but rare, problem.

Maybe we'd like to do that here.

If you don't know what regex is, it stands for "regular expressions". Perhaps someone below could offer a good explanation, and I'll edit this first post, replacing this paragraph.

I'll suggest the following format for submissions:
========================================

What it does: description of changes to text
Best used on: Text, HTML, Word, etc.
Regex Find:
Code:
```
Search regex code here.
```
Find Translation: Explain what the Find code stands for.
Regex Replace:
Code:
```
Substitution regex code here.
```
Replace Translation: Explain what the Replace code stands for.

Variants/Comments:

Code:

Similar regex for similar problems, or notes/warnings.

========================================

Here's one I use:

What it does: Finds Right-Single-Quotes ( ’ ) in contractions and replaces them with the ' entity-name.
Best used on: HTML
Regex Find:
Code:
```
([a-zI])\&rsquo\;([a-z]+)
```
Find Translation: Find any one lower case letter or capital "I" followed by the string "’" followed by at least one lower case letter
Regex Replace:
Code:
```
$1&apos;$2
```
Replace Translation: Put that first lower case letter or "I" back, put the string "'" in, then put whatever one or more lower case letters followed the original "’"
Variants/Comments: For text files, you can use the actual characters instead of the entity names. This works the other way, too; swap the entity names, and you can put a rsquo where an apos (or a literal apostrophe) was.

============================

Submitted Regex:

Swap apostrophes in contractions: Post #1
Change quotes using apostrophes to curly quotes: Post #3
Un-break hard-returns in a paragraph: Post #12
Format simple chapter headings: Post #14
Calibre: Import book metadata from filename Post #17

Sweetpea · 05-20-2009, 07:28 AM

May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)

pepak · 05-20-2009, 09:15 AM

Description: Replace apostrophes with curly quotes where appropriate.
Example-before: <p>'Wha' d'you mean,' he said, 'what's this regexp thing?'</p>
Example-after: <p>‘Wha' d'you mean,&rsquo he said, ‘what's this regexp thing?’</p>
Requirements: This regular expression expects HTML and proper punctuation. The HTML requirement can be removed by a proper rewrite, so the regexp can work as well, but proper punctuation is required - it is used to determine what's an apostrophe and what is a quote disguised as apostrophe.
Faults:
- With improper punctuation, anything can happen.
- Even with proper punctuation, some false-positives can occur; specifically, if a word starting with an apostrophe (e.g. 'tis) precedes actual apostrophe-quotes, the quotes are started at that word.
- Will fail if apostrophe immediately follows a non-paragraph tag. The regexp could be modified to work even then, but it would be a lot more difficult to read.
Regexp-find:

Code:

([>_])’(.*?[^a-z_])’([<_])

(note: use a space instead of underscore. If your editor supports that syntax, you can use \s [any blank character] instead of _)
Regexp-find-translation:
- space or end-of-tag. For plain text, you can use (^|_) (start-of-line or space), but then you will need to modify the replacement string
- apostrophe
- any character string, un-greedy (take as few as possible while maintaining match)
- any character except letters and space
- apostrophe
- space or begin-of-tag. For plaintext, you can use ($|_) (end-of-line or space)
Regexp-replace:

Code:

$1&lsquo;$2&rsquo;$3

Regexp-replace-translation:
- first parenthesis (character just preceding the quote)
- opening quote
- second parenthesis (content of the quote)
- closing quote
- third parenthesis (character following the quote)
Regexp-modifiers: case-insensitive, single-line, un-greedy
Regexp-syntax: FAR Manager's "Regular Expression Search and Replace" plugin. PHP's ereg/eregi needs to use \\1 instead of $1 in replacement string. PHP's preg needs "header" and "footer" ("~" regexp search "~igU")

rogue_ronin · 05-20-2009, 09:42 AM

Quote:

Originally Posted by Sweetpea

May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)

It's primarily perl (pcre) regex vs vim, right? Mine above is perl regex, and I believe pepaks are as well.

I use NoteTab, which uses pcre (perl compatible regular expressions.) I know near-to-nothing about vim, and I expect to keep it that way!

m a r

kurochka · 05-20-2009, 09:53 AM

Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular

regex. I use emeditor as my primary text editor, which uses something akin to perl engine I believe.

JSWolf · 05-20-2009, 10:30 AM

Will any of these regex expression work in Notepad++?

rogue_ronin · 05-20-2009, 11:04 AM

A quick google search suggests that notepad++ looks compatible. The first difference that I noticed, though, was that in the Replace codes, you should use \ instead of $.

I think it's an older version of pcre.

Work on a copy, check your results, and read that page I linked to. Basic regex is not terribly hard, once you get the concept.

Or switch to NoteTab.

(Dude, I like it so much I keep a virtual machine of Win2K just to run it...)

m a r

rogue_ronin · 05-20-2009, 11:08 AM

Quote:

Originally Posted by kurochka

Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular

regex. I use emeditor as my primary text editor, which uses something akin to perl engine I believe.

That's why I suggested where the regex is effective...HTML, text, Word (but maybe I should have been more explicit -- ie: in NoteTab, in Word, etc.)

And I hope that the "translation" will help with folks figuring out how to transform it into whatever their regex grammar happens to be.

m a r

Sweetpea · 05-20-2009, 12:56 PM

Quote:

Originally Posted by JSWolf

Will any of these regex expression work in Notepad++?

I tried pepak's one in Notepad++. Didn't work. I finally got it to work with UltraEdit.

Sunlite · 05-20-2009, 02:46 PM

There are two problems with the regex feature in Notepad++:

1) reg expression can only search each line separately. There is no search for multi line.
--> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs.

2) the non-greedy operator is missing.
--> This breaks pepak's regex. Without the non-greedy operator (aka ?) the expression finds only the first and last quotation mark. The given example would not be converted correctly.

kurochka · 05-20-2009, 05:23 PM

Quote:

Originally Posted by Sunlite

There are two problems with the regex feature in Notepad++:

1) reg expression can only search each line separately. There is no search for multi line.
--> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs.

O-o-oh! Multiline regex is a genie that is hard to keep in the bottle. I presume you are a pro and it's no problem for you, but for newbies in this thread I would like to put out a caution. Try to avoid (in most cases it is possible) multiline regex (where . or another wildcard can represent any other character including a new line) because it can bite and you will notice only when it is too late. I cannot even remember when the last time I turned on or needed multiline regex. Even if you must use it, try to limit the number of lines searched to the minimum necessary, and make an audit after each such replacement to make sure than there are no undesired consequences. Alternatively, do not use replace all, but go through the text one search at a time.

I was trying to think of some generalized regex that I can share but it appears that all my strings are so specific to the problem at hand that I cannot think of anything with wide application. As an example I have a text with both English and Ukrainian text and complex structure. If its an OCR, I may save it first to Word (it can search for formatting such as color, font size, italics, etc.), I would search for symbols that are used in tags (e.g., <>) and replace them with (\<, \>), then I'll put in formatting tags in Word (italics, bold, fonts, color, if necessary). An alternative would be to save OCR into html but I have found that the html conversion often creates such a mess with text and unnecessary for me tagging that I prefer to do it as described above.

Then I open the text in a text editor (emeditor in my case, it's the best out there). Typically, I analyze text before doing anything else, looking for patterns. I start with some simple replacements that would make the pattern more uniform. Even if there are tags or lines, etc. that will not be ultimately necessary, I try to keep them for now to see if they reveal something about the pattern that I can later use in my regexes. Given that I work with two languages English and Ukrainian, there are lots of OCR mistakes mixing Latin and Cyrillic so I use ranges such as [a-zàâçéèêëîïôûùüÿœæ] and [а-яґєії] to separate the two. At each step, I try not to make an irreversible mistake. For this reason, every once in a while I make a new version of the document and keep the old as a backup to be able to revert to it if I do screw up with something. It is easy to screw up when you have several hundred thousand lines.

rogue_ronin · 05-20-2009, 11:10 PM

Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.

I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...

I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.)

Thinking about it, it's probably safer/easier/smarter to do it in three runs:

==============================
Format: PCRE
What it does: Joins line broken paragraphs.
Best used on: Text.

Run #1
Find: \n\n+
Replace: |PARAGRAPH|

Run #2
Find: \n
Replace: \s

Run #3
Find: |PARAGRAPH|
Replace: \n\n

Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "|PARAGRAPH|"

Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not.

Run #3 finds all the "|PARAGRAPH|" markers and replaces them with two hard returns.

Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.)
==================================

@kurochka: you must have some things that you do repeatedly. Some things that you can generalize.

More later, like generic chapter headings.

m a r

pepak · 05-21-2009, 12:26 AM

Quote:

Originally Posted by rogue_ronin

Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.

Indeed.

Quote:

I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...

It is not. It is exactly the case where you need multi-line regexps :-)

rogue_ronin · 05-21-2009, 12:29 AM

========================================

What it does: Finds generic chapter headings, and converts them to your chosen format.
Before-Text:

Code:

<h4 id="Chapter 01">Chapter 01</h4>
<p>  chaPtertWo  </p>
Chapter III

After-Text:

Code:

<h3 class="chapter">Chapter 01</h3>

<h3 class="chapter">Chapter tWo</h3>

Chapter III

Best used on: HTML

Regex Find:

Code:

^\<(p.*|h\d.*)\>\s*[Cc][Hh][Aa][Pp][Tt][Ee][Rr]\s*(.*)\</(p|h\d)\>\s*

Find Translation: Find any single line that starts with a paragraph or header, followed immediately by any number of spaces or tabs and the word "chapter" in upper or lower (or mixed) case, and may have some spaces and characters following it, then a close of the paragraph/header, and any number of spaces, tabs or hard-returns.
Regex Replace:
Code:
```
<h3 class="chapter">Chapter $2</h3>\n\n
```
Replace Translation: Whatever follows the word "cHapTer" (and any number of spaces) in the original line is placed into an <h3 class="chapter"> header, and preceded with the word "Chapter".
Variants/Comments: Change the Replace header/wrap to suit yourself (the $2 is whatever the name of the chapter is.) Leave the \n\n in your Replace; the \s* grabs the hard returns. If you want to use this with plain text lines, remove \<(p.*|h\d.*)\> and \</(p|h\d)\> Changing (.*) to (.*?) (un-greedy) causes selection to miss the closing tag on my machine. Anyone know why?

========================================

pepak · 05-21-2009, 01:05 AM

A suggestion:

Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Perl and Regex	Alexander Turcic	Lounge	3	01-25-2011 07:48 PM
What a regex is	Worldwalker	Calibre	20	05-10-2010 05:51 AM
Help with a regex	A.T.E.	Calibre	1	04-05-2010 07:50 AM
Regex help needed	gandor62	Calibre	2	11-04-2009 10:27 AM
Regex help...	Bobthebass	Workshop	6	04-26-2009 03:54 PM

05-20-2009, 06:04 AM	#1
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Tyrannosaurus Regex In a couple of other threads, there was talk about starting a thread for useful regex. Stuff we use all the time, or stuff to solve a difficult, but rare, problem. Maybe we'd like to do that here. If you don't know what regex is, it stands for "regular expressions". Perhaps someone below could offer a good explanation, and I'll edit this first post, replacing this paragraph. I'll suggest the following format for submissions: ======================================== What it does: description of changes to text Best used on: Text, HTML, Word, etc. Regex Find: Code: Search regex code here. Find Translation: Explain what the Find code stands for. Regex Replace: Code: Substitution regex code here. Replace Translation: Explain what the Replace code stands for. Variants/Comments: Code: Similar regex for similar problems, or notes/warnings. ======================================== Here's one I use: What it does: Finds Right-Single-Quotes ( ’ ) in contractions and replaces them with the ' entity-name. Best used on: HTML Regex Find: Code: ([a-zI])\&rsquo\;([a-z]+) Find Translation: Find any one lower case letter or capital "I" followed by the string "’" followed by at least one lower case letter Regex Replace: Code: $1'$2 Replace Translation: Put that first lower case letter or "I" back, put the string "'" in, then put whatever one or more lower case letters followed the original "’" Variants/Comments: For text files, you can use the actual characters instead of the entity names. This works the other way, too; swap the entity names, and you can put a rsquo where an apos (or a literal apostrophe) was. ============================ Submitted Regex: Swap apostrophes in contractions: Post #1 Change quotes using apostrophes to curly quotes: Post #3 Un-break hard-returns in a paragraph: Post #12 Format simple chapter headings: Post #14 Calibre: Import book metadata from filename Post #17 Last edited by rogue_ronin; 05-29-2009 at 08:46 PM.

05-20-2009, 07:28 AM	#2
Sweetpea Grand Sorcerer Posts: 9,707 Karma: 32763414 Join Date: Dec 2008 Location: Krewerd Device: Pocketbook Inkpad 4 Color; Samsung Galaxy Tab S6	May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)

05-20-2009, 09:15 AM	#3
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	Description: Replace apostrophes with curly quotes where appropriate. Example-before: <p>'Wha' d'you mean,' he said, 'what's this regexp thing?'</p> Example-after: <p>‘Wha' d'you mean,&rsquo he said, ‘what's this regexp thing?’</p> Requirements: This regular expression expects HTML and proper punctuation. The HTML requirement can be removed by a proper rewrite, so the regexp can work as well, but proper punctuation is required - it is used to determine what's an apostrophe and what is a quote disguised as apostrophe. Faults: - With improper punctuation, anything can happen. - Even with proper punctuation, some false-positives can occur; specifically, if a word starting with an apostrophe (e.g. 'tis) precedes actual apostrophe-quotes, the quotes are started at that word. - Will fail if apostrophe immediately follows a non-paragraph tag. The regexp could be modified to work even then, but it would be a lot more difficult to read. Regexp-find: Code: ([>_])’(.?[^a-z_])’([<_]) (note: use a space instead of underscore. If your editor supports that syntax, you can use \s [any blank character] instead of _) Regexp-find-translation:* - space or end-of-tag. For plain text, you can use (^\|_) (start-of-line or space), but then you will need to modify the replacement string - apostrophe - any character string, un-greedy (take as few as possible while maintaining match) - any character except letters and space - apostrophe - space or begin-of-tag. For plaintext, you can use ($\|_) (end-of-line or space) Regexp-replace: Code: $1‘$2’$3 Regexp-replace-translation: - first parenthesis (character just preceding the quote) - opening quote - second parenthesis (content of the quote) - closing quote - third parenthesis (character following the quote) Regexp-modifiers: case-insensitive, single-line, un-greedy Regexp-syntax: FAR Manager's "Regular Expression Search and Replace" plugin. PHP's ereg/eregi needs to use \\1 instead of $1 in replacement string. PHP's preg needs "header" and "footer" ("~" regexp search "~igU")

05-20-2009, 09:53 AM	#5
kurochka your neighbor Posts: 20 Karma: 10 Join Date: Sep 2006 Device: PRS-500 lost (if you found it, I hope you enjoy it); DX on preorder	Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular regex. I use emeditor as my primary text editor, which uses something akin to perl engine I believe.

05-20-2009, 10:30 AM	#6
JSWolf Resident Curmudgeon Posts: 73,968 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Will any of these regex expression work in Notepad++?

05-20-2009, 11:04 AM	#7
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	A quick google search suggests that notepad++ looks compatible. The first difference that I noticed, though, was that in the Replace codes, you should use \ instead of $. I think it's an older version of pcre. Work on a copy, check your results, and read that page I linked to. Basic regex is not terribly hard, once you get the concept. Or switch to NoteTab. (Dude, I like it so much I keep a virtual machine of Win2K just to run it...) m a r

05-20-2009, 02:46 PM	#10
Sunlite Addict Posts: 206 Karma: 547516 Join Date: Mar 2008 Location: Berlin, Germany Device: KObo Clara, Kobo Aura, PRS-T1, PB602, CyBook Gen3	There are two problems with the regex feature in Notepad++: 1) reg expression can only search each line separately. There is no search for multi line. --> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs. 2) the non-greedy operator is missing. --> This breaks pepak's regex. Without the non-greedy operator (aka ?) the expression finds only the first and last quotation mark. The given example would not be converted correctly.

05-20-2009, 11:10 PM	#12
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap. I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see... I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.) Thinking about it, it's probably safer/easier/smarter to do it in three runs: ============================== Format: PCRE What it does: Joins line broken paragraphs. Best used on: Text. Run #1 Find: \n\n+ Replace: \|PARAGRAPH\| Run #2 Find: \n Replace: \s Run #3 Find: \|PARAGRAPH\| Replace: \n\n Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "\|PARAGRAPH\|" Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not. Run #3 finds all the "\|PARAGRAPH\|" markers and replaces them with two hard returns. Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.) ================================== @kurochka: you must have some things that you do repeatedly. Some things that you can generalize. More later, like generic chapter headings. m a r Last edited by rogue_ronin; 05-21-2009 at 12:45 AM. Reason: Caution about destroying HTML formatting...

05-21-2009, 12:29 AM	#14
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	======================================== What it does: Finds generic chapter headings, and converts them to your chosen format. Before-Text: Code: <h4 id="Chapter 01">Chapter 01</h4> <p> chaPtertWo </p> Chapter III After-Text: Code: <h3 class="chapter">Chapter 01</h3> <h3 class="chapter">Chapter tWo</h3> Chapter III Best used on: HTML Regex Find: Code: ^\<(p.\|h\d.)\>\s[Cc][Hh][Aa][Pp][Tt][Ee][Rr]\s(.)\</(p\|h\d)\>\s Find Translation: Find any single line that starts with a paragraph or header, followed immediately by any number of spaces or tabs and the word "chapter" in upper or lower (or mixed) case, and may have some spaces and characters following it, then a close of the paragraph/header, and any number of spaces, tabs or hard-returns. Regex Replace: Code: <h3 class="chapter">Chapter $2</h3>\n\n Replace Translation: Whatever follows the word "cHapTer" (and any number of spaces) in the original line is placed into an <h3 class="chapter"> header, and preceded with the word "Chapter". Variants/Comments: Change the Replace header/wrap to suit yourself (the $2 is whatever the name of the chapter is.) Leave the \n\n in your Replace; the \s* grabs the hard returns. If you want to use this with plain text lines, remove \<(p.\|h\d.)\> and \</(p\|h\d)\> Changing (.)* to (.?)* (un-greedy) causes selection to miss the closing tag on my machine. Anyone know why? ======================================== Last edited by rogue_ronin; 05-21-2009 at 12:34 AM. Reason: Forgot to let you have paragraphs with attributes...

05-21-2009, 01:05 AM	#15
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	A suggestion: Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp.

Advert

Advert