![]() |
#1 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 541
Karma: 1152752
Join Date: Aug 2010
Location: Evansville, IN, USA
Device: Samsung Galaxy Tab 4 Nook & Samsung Galaxy Tab S 10.5
|
Help with regular expression search/replace
When converting ebooks from any format for use on my Kindle, I always convert to epub first and do any editing or cleanup in Sigil before the final conversion to mobi. A common problem I run into when converting files from one format or another to epub is left or right quotes with no space between them and the preceding or following character. An extra space is easy to find but any other character is not so easy to find.
I'm thinking there should be a way to search for these occurrences with a regular expression but I'm not familiar enough with them to come up with one that works. I've tried and haven't had much luck so far. Would anyone out there more familiar with regular expressions be able to assist me? Basically, I want to be able to find any string where anycharacterexceptspace/“ or ”/anycharacterexceptspace and be able to replace it with anycharacterexceptspace/ “ or ” /anycharacterexceptspace. Any ideas? Thanks. - Byron |
![]() |
![]() |
![]() |
#2 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,891
Karma: 59840954
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
match any Except white space |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
|
IINM, \S will match '<' from </p> at the end of each paragraph.
I think \w should work, it matches any alphanumeric character (plus _ ). So: Find: ”([\w.,?!]) Replace with: ” \1 i.e. find ” followed by a word character OR . OR , OR ? OR ! And: Find: ([\w.,?!])“ Replace with: \1 “ |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
my fix - just looks for A-Z or for a-z as needed
no space following quotes find "([A_Z]) replace " \1 vary the above as needed. To ensure I search for the right sort of quote I copy / paste a quote mark from the code view into the find box |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Just a question, why would you want a space between a quote mark and the text? You have to be careful that your xhtml tags are not changed as well.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
i think because "this" is "correct" grammar but"this"is"wrong".
|
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Ah, is see. But you have to be careful you don't get thing like:
Then he said: " What is this? " or <p class=" stylish " > That is one reason why I use smart/curly quotes. Another is that I really like those quotes and feel that straight quotes have a different meaning. |
![]() |
![]() |
![]() |
#8 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Note that Ahmad Samir used curly quotes in his expressions.
|
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
|
![]() |
![]() |
![]() |
#10 | |
Member
![]() Posts: 10
Karma: 10
Join Date: Jun 2012
Device: Kobo Touch
|
Quote:
I don't know if the regular expression engine in the text editor TextMate is the same as the one in Sigil. But the following regular expression will find a string consisting of an entire html element in TextMate. <\?xml[^>]+>|<!DOCTYPE(?:[^\]]*]>|[^>]*>)|<[^/ >]+[^>]*/>|<(?<tagname>[^/ >]+)[^>]*>(?<!/>)(?<html>[^<]|<[^/ >]+[^>]*/>|<(?<tagname>[^/ >]+)[^>]*>(?<!/>)\g<html>*</\k<tagname+0>>)*</\k<tagname+0>> example: take the following string of text. <p>This is an <i>example</i> paragraph.</p><p>This is a second paragraph.</p> If the cursor is at the beginning of the text, the regular expression will match <p>This is an <i>example</i> paragraph.</p>. If the cursor is after the first < and not after the second <, it will match <i>example</i>. If the cursor is after the second <, it will match <p>This is a second paragraph.</p> In other words, it matches the first opening html tag encountered with its appropriate closing tag. But it will only work on properly formatted html. For example, in this improperly formatted html string <p>This is the first paragraph<p>This is the second paragraph</p> it will not match the first paragraph because the first closing tag </p> is missing. The regular expression can handle tags that close themselves like <p/> or <div/> or <link href="my.css" type="text/css" rel="stylesheet"/> or <a name="chap4" id="chap4"/>. Last edited by Funslinger; 06-20-2013 at 05:58 AM. |
|
![]() |
![]() |
![]() |
#11 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 119
Karma: 64428
Join Date: Aug 2011
Device: none
|
This will do it:
[^ ]“ where the quote mark is a left curly quote. The key is that a leading caret inside square brackets means "anything but", so we have "match anything but a space, followed by a left curly quote", just as the OP asked. |
![]() |
![]() |
![]() |
#12 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,018
Karma: 144284074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Let's say the ePub has a line such as...
This will be a <span class="smallcaps">TEST</span>. This is a <span class="smallcaps">TEST</span>. This is no longer a <span class="smallcaps">TEST</span>. Notice we have three spans. What I want to do is select each span individually. Can this be done? I want to take the contents of span #1 and span #3 and make them lowercase and leave span #2 alone. |
![]() |
![]() |
![]() |
#13 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 171
Karma: 86271
Join Date: Feb 2012
Device: iPad, Kindle Touch, Sony PRS-T1
|
if all the spans are on the same line you could use the one below. if there are more than 3 spans it captures the first 3, then the next 3. if there are 5 it captures only the first 3.
i think i'd recommend not doing this with regex though. Code:
find: (?<=<span class="smallcaps">)([^<\n]+)(</span>[^<\n]*)(<span class="smallcaps">)([^<\n]+)(</span>[^<\n]*)(<span class="smallcaps">)([^<]+)(?=</span>) replace: \1\2\3\4\5\6\7 Code:
This will be a <span class="smallcaps">TEST</span>. This is a <span class="smallcaps">TEST</span>. This is no longer a <span class="smallcaps">TEST</span>. Code:
first\2\3second\5\6third This will be a <span class="smallcaps">first</span>. This is a <span class="smallcaps">second</span>. This is no longer a <span class="smallcaps">third</span>. --edit also the ([^<\n]+) is strange to me in that i had to include the \n so that it didn't match across lines. not sure why this is, though. Last edited by mzmm; 06-20-2013 at 07:41 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expression Help | Azhad | Calibre | 86 | 09-27-2011 02:37 PM |
Search & Replace - Regular expression | oldbwl | Calibre | 2 | 01-09-2011 09:33 AM |
Regular Expression Help | iKarampa | Calibre | 13 | 12-15-2010 07:17 AM |
Regular expression help | krendk | Calibre | 4 | 12-04-2010 04:32 PM |
Find/Replace with regular expression | hydrolith | Sigil | 6 | 03-01-2010 08:42 PM |