Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-20-2009, 06:04 AM   #1
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Lightbulb Tyrannosaurus Regex

In a couple of other threads, there was talk about starting a thread for useful regex. Stuff we use all the time, or stuff to solve a difficult, but rare, problem.

Maybe we'd like to do that here.

If you don't know what regex is, it stands for "regular expressions". Perhaps someone below could offer a good explanation, and I'll edit this first post, replacing this paragraph.

I'll suggest the following format for submissions:
========================================
  • What it does: description of changes to text
  • Best used on: Text, HTML, Word, etc.
  • Regex Find:
    Code:
    Search regex code here.
  • Find Translation: Explain what the Find code stands for.
  • Regex Replace:
    Code:
    Substitution regex code here.
  • Replace Translation: Explain what the Replace code stands for.
  • Variants/Comments:
    Code:
    Similar regex for similar problems, or notes/warnings.
========================================

Here's one I use:
  • What it does: Finds Right-Single-Quotes ( ’ ) in contractions and replaces them with the ' entity-name.
  • Best used on: HTML
  • Regex Find:
    Code:
    ([a-zI])\&rsquo\;([a-z]+)
  • Find Translation: Find any one lower case letter or capital "I" followed by the string "’" followed by at least one lower case letter
  • Regex Replace:
    Code:
    $1'$2
  • Replace Translation: Put that first lower case letter or "I" back, put the string "'" in, then put whatever one or more lower case letters followed the original "’"
  • Variants/Comments: For text files, you can use the actual characters instead of the entity names. This works the other way, too; swap the entity names, and you can put a rsquo where an apos (or a literal apostrophe) was.
============================

Submitted Regex:
  • Swap apostrophes in contractions: Post #1
  • Change quotes using apostrophes to curly quotes: Post #3
  • Un-break hard-returns in a paragraph: Post #12
  • Format simple chapter headings: Post #14
  • Calibre: Import book metadata from filename Post #17

Last edited by rogue_ronin; 05-29-2009 at 08:46 PM.
rogue_ronin is offline   Reply With Quote
Old 05-20-2009, 07:28 AM   #2
Sweetpea
Grand Sorcerer
Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.
 
Sweetpea's Avatar
 
Posts: 7,934
Karma: 22621990
Join Date: Dec 2008
Location: Krewerd
Device: HTC Flyer; BBMini; Sony PRS650; Onyx Boox T68
May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)
Sweetpea is offline   Reply With Quote
Old 05-20-2009, 09:15 AM   #3
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Description: Replace apostrophes with curly quotes where appropriate.
Example-before: <p>'Wha' d'you mean,' he said, 'what's this regexp thing?'</p>
Example-after: <p>&lsquo;Wha' d'you mean,&rsquo he said, &lsquo;what's this regexp thing?&rsquo;</p>
Requirements: This regular expression expects HTML and proper punctuation. The HTML requirement can be removed by a proper rewrite, so the regexp can work as well, but proper punctuation is required - it is used to determine what's an apostrophe and what is a quote disguised as apostrophe.
Faults:
- With improper punctuation, anything can happen.
- Even with proper punctuation, some false-positives can occur; specifically, if a word starting with an apostrophe (e.g. 'tis) precedes actual apostrophe-quotes, the quotes are started at that word.
- Will fail if apostrophe immediately follows a non-paragraph tag. The regexp could be modified to work even then, but it would be a lot more difficult to read.
Regexp-find:
Code:
([>_])’(.*?[^a-z_])’([<_])
(note: use a space instead of underscore. If your editor supports that syntax, you can use \s [any blank character] instead of _)
Regexp-find-translation:
- space or end-of-tag. For plain text, you can use (^|_) (start-of-line or space), but then you will need to modify the replacement string
- apostrophe
- any character string, un-greedy (take as few as possible while maintaining match)
- any character except letters and space
- apostrophe
- space or begin-of-tag. For plaintext, you can use ($|_) (end-of-line or space)
Regexp-replace:
Code:
$1&lsquo;$2&rsquo;$3
Regexp-replace-translation:
- first parenthesis (character just preceding the quote)
- opening quote
- second parenthesis (content of the quote)
- closing quote
- third parenthesis (character following the quote)
Regexp-modifiers: case-insensitive, single-line, un-greedy
Regexp-syntax: FAR Manager's "Regular Expression Search and Replace" plugin. PHP's ereg/eregi needs to use \\1 instead of $1 in replacement string. PHP's preg needs "header" and "footer" ("~" regexp search "~igU")
pepak is offline   Reply With Quote
Old 05-20-2009, 09:42 AM   #4
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Quote:
Originally Posted by Sweetpea View Post
May I suggest you also note what tool you use? Not all tools use the same regex engine (I found out the hard way!)
It's primarily perl (pcre) regex vs vim, right? Mine above is perl regex, and I believe pepaks are as well.

I use NoteTab, which uses pcre (perl compatible regular expressions.) I know near-to-nothing about vim, and I expect to keep it that way!

m a r
rogue_ronin is offline   Reply With Quote
Old 05-20-2009, 09:53 AM   #5
kurochka
your neighbor
kurochka began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Sep 2006
Device: PRS-500 lost (if you found it, I hope you enjoy it); DX on preorder
Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular regex. I use emeditor as my primary text editor, which uses something akin to perl engine I believe.
kurochka is offline   Reply With Quote
Old 05-20-2009, 10:30 AM   #6
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,949
Karma: 17041886
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Will any of these regex expression work in Notepad++?
JSWolf is offline   Reply With Quote
Old 05-20-2009, 11:04 AM   #7
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
A quick google search suggests that notepad++ looks compatible. The first difference that I noticed, though, was that in the Replace codes, you should use \ instead of $.

I think it's an older version of pcre.

Work on a copy, check your results, and read that page I linked to. Basic regex is not terribly hard, once you get the concept.

Or switch to NoteTab. (Dude, I like it so much I keep a virtual machine of Win2K just to run it...)

m a r
rogue_ronin is offline   Reply With Quote
Old 05-20-2009, 11:08 AM   #8
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Quote:
Originally Posted by kurochka View Post
Well there is MS regex engine, which is totally different. MS Word has a quasi-regex, which is nothing like regular regex. I use emeditor as my primary text editor, which uses something akin to perl engine I believe.
That's why I suggested where the regex is effective...HTML, text, Word (but maybe I should have been more explicit -- ie: in NoteTab, in Word, etc.)

And I hope that the "translation" will help with folks figuring out how to transform it into whatever their regex grammar happens to be.

m a r
rogue_ronin is offline   Reply With Quote
Old 05-20-2009, 12:56 PM   #9
Sweetpea
Grand Sorcerer
Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.Sweetpea ought to be getting tired of karma fortunes by now.
 
Sweetpea's Avatar
 
Posts: 7,934
Karma: 22621990
Join Date: Dec 2008
Location: Krewerd
Device: HTC Flyer; BBMini; Sony PRS650; Onyx Boox T68
Quote:
Originally Posted by JSWolf View Post
Will any of these regex expression work in Notepad++?
I tried pepak's one in Notepad++. Didn't work. I finally got it to work with UltraEdit.
Sweetpea is offline   Reply With Quote
Old 05-20-2009, 02:46 PM   #10
Sunlite
Zealot
Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.Sunlite can program the VCR without an owner's manual.
 
Sunlite's Avatar
 
Posts: 119
Karma: 165124
Join Date: Mar 2008
Location: Berlin, Germany
Device: Kobo Aura, PRS-T1, PB602, CyBook Gen3
There are two problems with the regex feature in Notepad++:

1) reg expression can only search each line separately. There is no search for multi line.
--> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs.

2) the non-greedy operator is missing.
--> This breaks pepak's regex. Without the non-greedy operator (aka ?) the expression finds only the first and last quotation mark. The given example would not be converted correctly.
Sunlite is offline   Reply With Quote
Old 05-20-2009, 05:23 PM   #11
kurochka
your neighbor
kurochka began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Sep 2006
Device: PRS-500 lost (if you found it, I hope you enjoy it); DX on preorder
Quote:
Originally Posted by Sunlite View Post
There are two problems with the regex feature in Notepad++:

1) reg expression can only search each line separately. There is no search for multi line.
--> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs.
O-o-oh! Multiline regex is a genie that is hard to keep in the bottle. I presume you are a pro and it's no problem for you, but for newbies in this thread I would like to put out a caution. Try to avoid (in most cases it is possible) multiline regex (where . or another wildcard can represent any other character including a new line) because it can bite and you will notice only when it is too late. I cannot even remember when the last time I turned on or needed multiline regex. Even if you must use it, try to limit the number of lines searched to the minimum necessary, and make an audit after each such replacement to make sure than there are no undesired consequences. Alternatively, do not use replace all, but go through the text one search at a time.

I was trying to think of some generalized regex that I can share but it appears that all my strings are so specific to the problem at hand that I cannot think of anything with wide application. As an example I have a text with both English and Ukrainian text and complex structure. If its an OCR, I may save it first to Word (it can search for formatting such as color, font size, italics, etc.), I would search for symbols that are used in tags (e.g., <>) and replace them with (\<, \>), then I'll put in formatting tags in Word (italics, bold, fonts, color, if necessary). An alternative would be to save OCR into html but I have found that the html conversion often creates such a mess with text and unnecessary for me tagging that I prefer to do it as described above.

Then I open the text in a text editor (emeditor in my case, it's the best out there). Typically, I analyze text before doing anything else, looking for patterns. I start with some simple replacements that would make the pattern more uniform. Even if there are tags or lines, etc. that will not be ultimately necessary, I try to keep them for now to see if they reveal something about the pattern that I can later use in my regexes. Given that I work with two languages English and Ukrainian, there are lots of OCR mistakes mixing Latin and Cyrillic so I use ranges such as [a-zÓÔšÚŔŕŰţ´˘ű¨Ř œŠ] and [а-яґєії] to separate the two. At each step, I try not to make an irreversible mistake. For this reason, every once in a while I make a new version of the document and keep the old as a backup to be able to revert to it if I do screw up with something. It is easy to screw up when you have several hundred thousand lines.

Last edited by kurochka; 05-20-2009 at 05:50 PM.
kurochka is offline   Reply With Quote
Old 05-20-2009, 11:10 PM   #12
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.

I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...

I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.)

Thinking about it, it's probably safer/easier/smarter to do it in three runs:

==============================
Format: PCRE
What it does: Joins line broken paragraphs.
Best used on: Text.

Run #1
Find: \n\n+
Replace: |PARAGRAPH|

Run #2
Find: \n
Replace: \s

Run #3
Find: |PARAGRAPH|
Replace: \n\n

Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "|PARAGRAPH|"

Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not.

Run #3 finds all the "|PARAGRAPH|" markers and replaces them with two hard returns.

Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.)
==================================

@kurochka: you must have some things that you do repeatedly. Some things that you can generalize.

More later, like generic chapter headings.

m a r

Last edited by rogue_ronin; 05-21-2009 at 12:45 AM. Reason: Caution about destroying HTML formatting...
rogue_ronin is offline   Reply With Quote
Old 05-21-2009, 12:26 AM   #13
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Quote:
Originally Posted by rogue_ronin View Post
Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.
Indeed.

Quote:
I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...
It is not. It is exactly the case where you need multi-line regexps :-)
pepak is offline   Reply With Quote
Old 05-21-2009, 12:29 AM   #14
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
========================================
  • What it does: Finds generic chapter headings, and converts them to your chosen format.
    Before-Text:
    Code:
    <h4 id="Chapter 01">Chapter 01</h4>
    <p>  chaPtertWo  </p>
    Chapter III
    After-Text:
    Code:
    <h3 class="chapter">Chapter 01</h3>
    
    <h3 class="chapter">Chapter tWo</h3>
    
    Chapter III
  • Best used on: HTML
  • Regex Find:
    Code:
    ^\<(p.*|h\d.*)\>\s*[Cc][Hh][Aa][Pp][Tt][Ee][Rr]\s*(.*)\</(p|h\d)\>\s*
  • Find Translation: Find any single line that starts with a paragraph or header, followed immediately by any number of spaces or tabs and the word "chapter" in upper or lower (or mixed) case, and may have some spaces and characters following it, then a close of the paragraph/header, and any number of spaces, tabs or hard-returns.
  • Regex Replace:
    Code:
    <h3 class="chapter">Chapter $2</h3>\n\n
  • Replace Translation: Whatever follows the word "cHapTer" (and any number of spaces) in the original line is placed into an <h3 class="chapter"> header, and preceded with the word "Chapter".
  • Variants/Comments: Change the Replace header/wrap to suit yourself (the $2 is whatever the name of the chapter is.) Leave the \n\n in your Replace; the \s* grabs the hard returns. If you want to use this with plain text lines, remove \<(p.*|h\d.*)\> and \</(p|h\d)\> Changing (.*) to (.*?) (un-greedy) causes selection to miss the closing tag on my machine. Anyone know why?
========================================

Last edited by rogue_ronin; 05-21-2009 at 12:34 AM. Reason: Forgot to let you have paragraphs with attributes...
rogue_ronin is offline   Reply With Quote
Old 05-21-2009, 01:05 AM   #15
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
A suggestion:

Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp.
pepak is offline   Reply With Quote
Reply

Tags
edit, regex, regular expressions

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl and Regex Alexander Turcic Lounge 3 01-25-2011 07:48 PM
What a regex is Worldwalker Calibre 20 05-10-2010 05:51 AM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM
Regex help needed gandor62 Calibre 2 11-04-2009 10:27 AM
Regex help... Bobthebass Workshop 6 04-26-2009 03:54 PM


All times are GMT -4. The time now is 04:40 AM.


MobileRead.com is a privately owned, operated and funded community.