Help with regular expression search/replace

bfollowell · 02-15-2011, 10:40 AM

When converting ebooks from any format for use on my Kindle, I always convert to epub first and do any editing or cleanup in Sigil before the final conversion to mobi. A common problem I run into when converting files from one format or another to epub is left or right quotes with no space between them and the preceding or following character. An extra space is easy to find but any other character is not so easy to find.

I'm thinking there should be a way to search for these occurrences with a regular expression but I'm not familiar enough with them to come up with one that works. I've tried and haven't had much luck so far.

Would anyone out there more familiar with regular expressions be able to assist me? Basically, I want to be able to find any string where anycharacterexceptspace/“ or ”/anycharacterexceptspace and be able to replace it with anycharacterexceptspace/ “ or ” /anycharacterexceptspace.

Any ideas?

Thanks.

- Byron

theducks · 02-15-2011, 10:49 AM

Quote:

Originally Posted by bfollowell

Would anyone out there more familiar with regular expressions be able to assist me? Basically, I want to be able to find any string where anycharacterexceptspace/“ or ”/anycharacterexceptspace and be able to replace it with anycharacterexceptspace/ “ or ” /anycharacterexceptspace.

Any ideas?

Thanks.

- Byron

My 'cheat sheet' says \S (capital S )
match any Except white space

Ahmad Samir · 02-16-2011, 03:34 PM

IINM, \S will match '<' from at the end of each paragraph.

I think \w should work, it matches any alphanumeric character (plus _ ).

So:
Find: ”([\w.,?!])
Replace with: ” \1

i.e. find ” followed by a word character OR . OR , OR ? OR !

And:
Find: ([\w.,?!])“
Replace with: \1 “

cybmole · 02-20-2011, 02:47 AM

my fix - just looks for A-Z or for a-z as needed
no space following quotes
find "([A_Z])
replace " \1

vary the above as needed. To ensure I search for the right sort of quote I copy / paste a quote mark from the code view into the find box

Toxaris · 02-20-2011, 04:10 AM

Just a question, why would you want a space between a quote mark and the text? You have to be careful that your xhtml tags are not changed as well.

cybmole · 02-20-2011, 04:17 AM

i think because "this" is "correct" grammar but"this"is"wrong".

Toxaris · 02-20-2011, 05:59 AM

Ah, is see. But you have to be careful you don't get thing like:

Then he said: " What is this? "

or



That is one reason why I use smart/curly quotes. Another is that I really like those quotes and feel that straight quotes have a different meaning.

Jellby · 02-20-2011, 06:12 AM

Note that Ahmad Samir used curly quotes in his expressions.

cybmole · 02-20-2011, 06:16 AM

Quote:

Originally Posted by Jellby

Note that Ahmad Samir used curly quotes in his expressions.

yes, the solutions given only work if open quote looks different to / can be distinguished from closing quote

and quotes within quotes is a whole new ball game!

Funslinger · 06-20-2013, 05:53 AM

Quote:

Originally Posted by cybmole

yes, the solutions given only work if open quote looks different to / can be distinguished from closing quote

and quotes within quotes is a whole new ball game!

This situation can be handled fairly easily using recursion to match opening and closing quotes.

I don't know if the regular expression engine in the text editor TextMate is the same as the one in Sigil. But the following regular expression will find a string consisting of an entire html element in TextMate.

<\?xml[^>]+>|<!DOCTYPE(?:[^\]]*]>|[^>]*>)|<[^/ >]+[^>]*/>|<(?<tagname>[^/ >]+)[^>]*>(?<!/>)(?<html>[^<]|<[^/ >]+[^>]*/>|<(?<tagname>[^/ >]+)[^>]*>(?<!/>)\g<html>*</\k<tagname+0>>)*</\k<tagname+0>>

example: take the following string of text.

This is an example paragraph.This is a second paragraph.

If the cursor is at the beginning of the text, the regular expression will match This is an example paragraph.. If the cursor is after the first < and not after the second <, it will match example. If the cursor is after the second <, it will match This is a second paragraph.

In other words, it matches the first opening html tag encountered with its appropriate closing tag. But it will only work on properly formatted html. For example, in this improperly formatted html string

This is the first paragraphThis is the second paragraph

it will not match the first paragraph because the first closing tag is missing.

The regular expression can handle tags that close themselves like or <div/> or <link href="my.css" type="text/css" rel="stylesheet"/> or <a name="chap4" id="chap4"/>.

signum · 06-20-2013, 05:18 PM

This will do it:
[^ ]“
where the quote mark is a left curly quote. The key is that a leading caret inside square brackets means "anything but", so we have "match anything but a space, followed by a left curly quote", just as the OP asked.

JSWolf · 06-20-2013, 05:28 PM

Let's say the ePub has a line such as...

This will be a TEST. This is a TEST. This is no longer a TEST.

Notice we have three spans. What I want to do is select each span individually. Can this be done? I want to take the contents of span #1 and span #3 and make them lowercase and leave span #2 alone.

mzmm · 06-20-2013, 07:36 PM

if all the spans are on the same line you could use the one below. if there are more than 3 spans it captures the first 3, then the next 3. if there are 5 it captures only the first 3.

i think i'd recommend not doing this with regex though.

Code:

find:
(?<=<span class="smallcaps">)([^<\n]+)(</span>[^<\n]*)(<span class="smallcaps">)([^<\n]+)(</span>[^<\n]*)(<span class="smallcaps">)([^<]+)(?=</span>)

replace:
\1\2\3\4\5\6\7

where you'd perform operations on \1, \4 and \7. so given

Code:

This will be a <span class="smallcaps">TEST</span>. This is a <span class="smallcaps">TEST</span>. This is no longer a <span class="smallcaps">TEST</span>.

Code:

first\2\3second\5\6third

This will be a <span class="smallcaps">first</span>. This is a <span class="smallcaps">second</span>. This is no longer a <span class="smallcaps">third</span>.

i'd initially grouped the first in the lookahead and then inserted it like \2\3\1\4\5\1\6 but apparently sigil didn't like reusing the backreference. possibly a bug?

--edit

also the ([^<\n]+) is strange to me in that i had to include the \n so that it didn't match across lines. not sure why this is, though.

02-20-2011, 05:59 AM	#7
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Ah, is see. But you have to be careful you don't get thing like: Then he said: " What is this? " or <p class=" stylish " > That is one reason why I use smart/curly quotes. Another is that I really like those quotes and feel that straight quotes have a different meaning.

06-20-2013, 05:28 PM	#12
JSWolf Resident Curmudgeon Posts: 79,018 Karma: 144284074 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Let's say the ePub has a line such as... This will be a <span class="smallcaps">TEST</span>. This is a <span class="smallcaps">TEST</span>. This is no longer a <span class="smallcaps">TEST</span>. Notice we have three spans. What I want to do is select each span individually. Can this be done? I want to take the contents of span #1 and span #3 and make them lowercase and leave span #2 alone.

06-20-2013, 07:36 PM	#13
mzmm Groupie Posts: 171 Karma: 86271 Join Date: Feb 2012 Device: iPad, Kindle Touch, Sony PRS-T1	if all the spans are on the same line you could use the one below. if there are more than 3 spans it captures the first 3, then the next 3. if there are 5 it captures only the first 3. i think i'd recommend not doing this with regex though. Code: find: (?<=<span class="smallcaps">)([^<\n]+)(</span>[^<\n])(<span class="smallcaps">)([^<\n]+)(</span>[^<\n])(<span class="smallcaps">)([^<]+)(?=</span>) replace: \1\2\3\4\5\6\7 where you'd perform operations on \1, \4 and \7. so given Code: This will be a <span class="smallcaps">TEST</span>. This is a <span class="smallcaps">TEST</span>. This is no longer a <span class="smallcaps">TEST</span>. Code: first\2\3second\5\6third This will be a <span class="smallcaps">first</span>. This is a <span class="smallcaps">second</span>. This is no longer a <span class="smallcaps">third</span>. i'd initially grouped the first <span class="smallcaps"> in the lookahead and then inserted it like \2\3\1\4\5\1\6 but apparently sigil didn't like reusing the backreference. possibly a bug? --edit also the ([^<\n]+) is strange to me in that i had to include the \n so that it didn't match across lines. not sure why this is, though. Last edited by mzmm; 06-20-2013 at 07:41 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regular Expression Help	Azhad	Calibre	86	09-27-2011 02:37 PM
Search & Replace - Regular expression	oldbwl	Calibre	2	01-09-2011 09:33 AM
Regular Expression Help	iKarampa	Calibre	13	12-15-2010 07:17 AM
Regular expression help	krendk	Calibre	4	12-04-2010 04:32 PM
Find/Replace with regular expression	hydrolith	Sigil	6	03-01-2010 08:42 PM

02-15-2011, 10:40 AM	#1
bfollowell Fanatic Posts: 541 Karma: 1152752 Join Date: Aug 2010 Location: Evansville, IN, USA Device: Samsung Galaxy Tab 4 Nook & Samsung Galaxy Tab S 10.5	Help with regular expression search/replace When converting ebooks from any format for use on my Kindle, I always convert to epub first and do any editing or cleanup in Sigil before the final conversion to mobi. A common problem I run into when converting files from one format or another to epub is left or right quotes with no space between them and the preceding or following character. An extra space is easy to find but any other character is not so easy to find. I'm thinking there should be a way to search for these occurrences with a regular expression but I'm not familiar enough with them to come up with one that works. I've tried and haven't had much luck so far. Would anyone out there more familiar with regular expressions be able to assist me? Basically, I want to be able to find any string where anycharacterexceptspace/“ or ”/anycharacterexceptspace and be able to replace it with anycharacterexceptspace/ “ or ” /anycharacterexceptspace. Any ideas? Thanks. - Byron

02-16-2011, 03:34 PM	#3
Ahmad Samir Zealot Posts: 114 Karma: 5246 Join Date: Jul 2010 Device: none	IINM, \S will match '<' from </p> at the end of each paragraph. I think \w should work, it matches any alphanumeric character (plus _ ). So: Find: ”([\w.,?!]) Replace with: ” \1 i.e. find ” followed by a word character OR . OR , OR ? OR ! And: Find: ([\w.,?!])“ Replace with: \1 “

02-20-2011, 02:47 AM	#4
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	my fix - just looks for A-Z or for a-z as needed no space following quotes find "([A_Z]) replace " \1 vary the above as needed. To ensure I search for the right sort of quote I copy / paste a quote mark from the code view into the find box

02-20-2011, 04:10 AM	#5
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Just a question, why would you want a space between a quote mark and the text? You have to be careful that your xhtml tags are not changed as well.

02-20-2011, 04:17 AM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i think because "this" is "correct" grammar but"this"is"wrong".

02-20-2011, 06:12 AM	#8
Jellby frumious Bandersnatch Posts: 7,543 Karma: 19001583 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Note that Ahmad Samir used curly quotes in his expressions.

06-20-2013, 05:18 PM	#11
signum Zealot Posts: 119 Karma: 64428 Join Date: Aug 2011 Device: none	This will do it: [^ ]“ where the quote mark is a left curly quote. The key is that a leading caret inside square brackets means "anything but", so we have "match anything but a space, followed by a left curly quote", just as the OP asked.

Advert

Advert