![]() |
#1 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
![]()
In a couple of other threads, there was talk about starting a thread for useful regex. Stuff we use all the time, or stuff to solve a difficult, but rare, problem.
Maybe we'd like to do that here. If you don't know what regex is, it stands for "regular expressions". Perhaps someone below could offer a good explanation, and I'll edit this first post, replacing this paragraph. I'll suggest the following format for submissions: ========================================
Here's one I use:
Submitted Regex: Last edited by rogue_ronin; 05-29-2009 at 08:46 PM. |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,707
Karma: 32763414
Join Date: Dec 2008
Location: Krewerd
Device: Pocketbook Inkpad 4 Color; Samsung Galaxy Tab S6
|
May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Description: Replace apostrophes with curly quotes where appropriate.
Example-before: <p>'Wha' d'you mean,' he said, 'what's this regexp thing?'</p> Example-after: <p>‘Wha' d'you mean,&rsquo he said, ‘what's this regexp thing?’</p> Requirements: This regular expression expects HTML and proper punctuation. The HTML requirement can be removed by a proper rewrite, so the regexp can work as well, but proper punctuation is required - it is used to determine what's an apostrophe and what is a quote disguised as apostrophe. Faults: - With improper punctuation, anything can happen. - Even with proper punctuation, some false-positives can occur; specifically, if a word starting with an apostrophe (e.g. 'tis) precedes actual apostrophe-quotes, the quotes are started at that word. - Will fail if apostrophe immediately follows a non-paragraph tag. The regexp could be modified to work even then, but it would be a lot more difficult to read. Regexp-find: Code:
([>_])’(.*?[^a-z_])’([<_]) Regexp-find-translation: - space or end-of-tag. For plain text, you can use (^|_) (start-of-line or space), but then you will need to modify the replacement string - apostrophe - any character string, un-greedy (take as few as possible while maintaining match) - any character except letters and space - apostrophe - space or begin-of-tag. For plaintext, you can use ($|_) (end-of-line or space) Regexp-replace: Code:
$1‘$2’$3 - first parenthesis (character just preceding the quote) - opening quote - second parenthesis (content of the quote) - closing quote - third parenthesis (character following the quote) Regexp-modifiers: case-insensitive, single-line, un-greedy Regexp-syntax: FAR Manager's "Regular Expression Search and Replace" plugin. PHP's ereg/eregi needs to use \\1 instead of $1 in replacement string. PHP's preg needs "header" and "footer" ("~" regexp search "~igU") |
![]() |
![]() |
![]() |
#4 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
Quote:
I use NoteTab, which uses pcre (perl compatible regular expressions.) I know near-to-nothing about vim, and I expect to keep it that way! ![]() m a r |
|
![]() |
![]() |
![]() |
#5 |
your neighbor
![]() Posts: 20
Karma: 10
Join Date: Sep 2006
Device: PRS-500 lost (if you found it, I hope you enjoy it); DX on preorder
|
Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular
![]() |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,158
Karma: 144284184
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Will any of these regex expression work in Notepad++?
|
![]() |
![]() |
![]() |
#7 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
A quick google search suggests that notepad++ looks compatible. The first difference that I noticed, though, was that in the Replace codes, you should use \ instead of $.
I think it's an older version of pcre. Work on a copy, check your results, and read that page I linked to. Basic regex is not terribly hard, once you get the concept. Or switch to NoteTab. ![]() m a r |
![]() |
![]() |
![]() |
#8 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
Quote:
And I hope that the "translation" will help with folks figuring out how to transform it into whatever their regex grammar happens to be. m a r |
|
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,707
Karma: 32763414
Join Date: Dec 2008
Location: Krewerd
Device: Pocketbook Inkpad 4 Color; Samsung Galaxy Tab S6
|
|
![]() |
![]() |
![]() |
#10 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 206
Karma: 547516
Join Date: Mar 2008
Location: Berlin, Germany
Device: KObo Clara, Kobo Aura, PRS-T1, PB602, CyBook Gen3
|
There are two problems with the regex feature in Notepad++:
1) reg expression can only search each line separately. There is no search for multi line. --> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs. 2) the non-greedy operator is missing. --> This breaks pepak's regex. Without the non-greedy operator (aka ?) the expression finds only the first and last quotation mark. The given example would not be converted correctly. |
![]() |
![]() |
![]() |
#11 | |
your neighbor
![]() Posts: 20
Karma: 10
Join Date: Sep 2006
Device: PRS-500 lost (if you found it, I hope you enjoy it); DX on preorder
|
Quote:
I was trying to think of some generalized regex that I can share but it appears that all my strings are so specific to the problem at hand that I cannot think of anything with wide application. As an example I have a text with both English and Ukrainian text and complex structure. If its an OCR, I may save it first to Word (it can search for formatting such as color, font size, italics, etc.), I would search for symbols that are used in tags (e.g., <>) and replace them with (\<, \>), then I'll put in formatting tags in Word (italics, bold, fonts, color, if necessary). An alternative would be to save OCR into html but I have found that the html conversion often creates such a mess with text and unnecessary for me tagging that I prefer to do it as described above. Then I open the text in a text editor (emeditor in my case, it's the best out there). Typically, I analyze text before doing anything else, looking for patterns. I start with some simple replacements that would make the pattern more uniform. Even if there are tags or lines, etc. that will not be ultimately necessary, I try to keep them for now to see if they reveal something about the pattern that I can later use in my regexes. Given that I work with two languages English and Ukrainian, there are lots of OCR mistakes mixing Latin and Cyrillic so I use ranges such as [a-zàâçéèêëîïôûùüÿœæ] and [а-яґєії] to separate the two. At each step, I try not to make an irreversible mistake. For this reason, every once in a while I make a new version of the document and keep the old as a backup to be able to revert to it if I do screw up with something. It is easy to screw up when you have several hundred thousand lines. Last edited by kurochka; 05-20-2009 at 05:50 PM. |
|
![]() |
![]() |
![]() |
#12 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.
I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see... I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.) Thinking about it, it's probably safer/easier/smarter to do it in three runs: ============================== Format: PCRE What it does: Joins line broken paragraphs. Best used on: Text. Run #1 Find: \n\n+ Replace: |PARAGRAPH| Run #2 Find: \n Replace: \s Run #3 Find: |PARAGRAPH| Replace: \n\n Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "|PARAGRAPH|" Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not. Run #3 finds all the "|PARAGRAPH|" markers and replaces them with two hard returns. Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.) ================================== @kurochka: you must have some things that you do repeatedly. Some things that you can generalize. More later, like generic chapter headings. m a r Last edited by rogue_ronin; 05-21-2009 at 12:45 AM. Reason: Caution about destroying HTML formatting... |
![]() |
![]() |
![]() |
#13 | ||
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#14 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
========================================
Last edited by rogue_ronin; 05-21-2009 at 12:34 AM. Reason: Forgot to let you have paragraphs with attributes... |
![]() |
![]() |
![]() |
#15 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
A suggestion:
Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp. |
![]() |
![]() |
![]() |
Tags |
edit, regex, regular expressions |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Perl and Regex | Alexander Turcic | Lounge | 3 | 01-25-2011 07:48 PM |
What a regex is | Worldwalker | Calibre | 20 | 05-10-2010 05:51 AM |
Help with a regex | A.T.E. | Calibre | 1 | 04-05-2010 07:50 AM |
Regex help needed | gandor62 | Calibre | 2 | 11-04-2009 10:27 AM |
Regex help... | Bobthebass | Workshop | 6 | 04-26-2009 03:54 PM |