MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

Doitsu 08-11-2012 10:44 AM

A quick and dirty solution would be:

Find: (chapter) ([[:lower:]]+)
Replace: \u\1 \u\2

This requires Sigil 0.5.3 (or higher).

Gunnerp245 08-11-2012 10:03 PM

Quote:

Originally Posted by Gunnerp245 (Post 2181500)
I would like to change the capitalization a particular phrase across a book e.g. chapter one to Chapter One. I can detect the instances using (\D+) (\D+) and know the replacement would be \1 \2, but not how to change the capitalization.

Quote:

Originally Posted by Doitsu (Post 2181542)
A quick and dirty solution would be:
Find: (chapter) ([[:lower:]]+)
Replace: \u\1 \u\2
This requires Sigil 0.5.3 (or higher).

@Doitsu - Left out key information initially, I am attempting this is calibre 'search and replace'.:o.

Doitsu 08-12-2012 03:35 AM

Quote:

Originally Posted by Gunnerp245 (Post 2182012)
@Doitsu - Left out key information initially, I am attempting this is calibre 'search and replace'.:o.

Starting with version 0.5.3, Sigil uses Perl compatible regular expressions (PCRE) including the \u operator (which capitalizes the first letter of a string).
AFAIK, Calibre uses the Python regular expression library, which doesn't support the \u operator.
The expressions that I suggested will work in Sigl or any text editor with PCRE support.
Is there any particular reason why want to use Calibre to replace the text?

Jellby 08-12-2012 04:40 AM

Quote:

Originally Posted by Doitsu (Post 2182180)
Is there any particular reason why want to use Calibre to replace the text?

And is there any particular reason why you ask a Calibre question in the Sigil forum? ;)

Gunnerp245 08-12-2012 10:51 AM

@Doitsu/Jellby

I was reading through the regex sticky and posted my question before realizing which software forum I was in. I have been able to glean very helpful information. Though the other forum has a regex sticky it does not seem as detailed as this one.

I have moved my query here.

Greygor 08-23-2012 10:55 AM

I hate my first post to be a question rather than an answer but needs must when the devil drives.

I have an epub where speech quotes are missing from the start of the line

e.g.

Quote:

<span class="bold calibre4">The gets one," ff 11
So my first task was finding an expression that would find lines where this occurs.

This seems to work (I know there are cases where it fail, but I'm just finding and not auto-fixing)

Quote:

"\>[^"](\w*\W*)*"
However when I try it in Sigil it completely fails.

Is there something special about Sigil's regex that I'm overlooking?

Many thanks in advance

Doitsu 08-23-2012 01:11 PM

IMHO, the problem is \w*\W*, which matches a sequence of 0 or more word characters followed by 0 or more non-word characters. I.e., it will at most match one word plus a space or punctuation character. Try .*? instead:

Code:

"\>[^"](.*?)"

Greygor 08-23-2012 01:22 PM

Quote:

Originally Posted by Doitsu (Post 2195154)
IMHO, the problem is \w*\W*, which matches a sequence of 0 or more word characters followed by 0 or more non-word characters. I.e., it will at most match one word plus a space or punctuation character. Try .*? instead:

Code:

"\>[^"](.*?)"

Hi thanks for the response.

Outside of Sigil the regex that I was using worked fine which is what I'm finding odd.

Using
Quote:

"\>[^"](.*?)"
lets me find lots of
Quote:

"><span class="
but unfortunately that's not quite right, but at least it finds something in Sigil which mine doesn't :)

Need to think about this, I'm missing something really obvious :smack:

DiapDealer 08-23-2012 01:50 PM

Quote:

Originally Posted by Greygor (Post 2195168)
Outside of Sigil the regex that I was using worked fine which is what I'm finding odd.

Not really that odd. There are several different regex engines that all have subtle differences. So the questions would be: what application are you using where your original regex does succeed? And what version of Sigil are you using where it doesn't succeed?

paulfiera 09-25-2012 10:01 AM

Strange issue
 
Most surely, I'm not understanding this the right way.

I'm cleaning up some epubs and have noticed that some of them have anchor tags with a class and an id but without any hyperlink. Some epubs have several hundred in between the text.

So I'm using this regex to find anchor links with nothing inside them

Quote:

<a class="(.*?)" id="(.*?)"></a>
The problem is that it also finds these tags with spans and even text inside.

I would like to be able to restrict the findings to only this situation.

Many thanks!

Doitsu 09-25-2012 10:41 AM

Quote:

Originally Posted by paulfiera (Post 2236440)
The problem is that it also finds these tags with spans and even text inside.

Your regex looks fine to me. Can you post some specific examples of unwanted html tags matched by your regex and the Sigil version that you're using?

DiapDealer 09-25-2012 10:47 AM

I'm not certain why that expression would match instances with spans or text inside the anchor tags. It shouldn't really.

You might try:
Code:

<a class="([^>]*?)" id="([^>]*?)"></a>
instead ... just to check.

But I can't get your expression to misbehave, really. It seems to do (for me anyway) what you've intended it to do. Can you give any examples of code it's matched that you don't think it should match?

theducks 09-25-2012 10:48 AM

Quote:

Originally Posted by paulfiera (Post 2236440)
Most surely, I'm not understanding this the right way.

I'm cleaning up some epubs and have noticed that some of them have anchor tags with a class and an id but without any hyperlink. Some epubs have several hundred in between the text.

So I'm using this regex to find anchor links with nothing inside them



The problem is that it also finds these tags with spans and even text inside.

I would like to be able to restrict the findings to only this situation.

Many thanks!

Your REGEX is fine as Doitsu said. It is also overly broad (second term) ;)
which is why it is matching </span></a>

if your id has an ending numbers use that to narrow the scope:(.+?\d+)"></a>

Jellby 09-25-2012 11:40 AM

Quote:

Originally Posted by DiapDealer (Post 2236521)
But I can't get your expression to misbehave, really. It seems to do (for me anyway) what you've intended it to do. Can you give any examples of code it's matched that you don't think it should match?

<a class="whatever" href="#here">this is a link</a> <a class="other" id="something"></a>

The whole red part would be matched by the first (.*?), right?

paulfiera 09-25-2012 11:59 AM

Thanks, Doitsu and DiapDealer

This is from Clive Barker's Imajica

Using

Quote:

<a class="(.*?)" id="(.*?)"></a>
and clicking on Count All, it finds 18 matches.

Clicking on Find, the first match is this one:

Quote:

<a class="calibre16" href="../Text/Imajica_split_211.html#filepos2489718" id="filepos69564"><span class="calibre17">William</span></a>—and they had only argued once, but it had been a telling exchange. She’d accused him of always looking at other women; looking, looking, as though for the next conquest. Perhaps because he didn’t care for her too much, he’d replied honestly and told her she was right. He was stupid for her sex. Sickened in their absence, blissful in their company: love’s fool. She’d replied that while his obsession might be healthier than her husband’s—which was money and its manipulation—his behavior was still neurotic. Why this endless hunt? she’d asked him. He’d answered with some folderol about seeking the idealwoman, but he’d known the truth even as he was spinning her this tosh, and it was a bitter thing. Too bitter, in fact, to be put on his tongue. In essence, it came down to <a class="calibre22"></a>
Clicking on Find again, it matches this one:


Quote:

<a class="calibre16" href="../Text/Imajica_split_199.html#filepos2416389" id="filepos73127"><span class="calibre17">Gloriana</span></a>, one of his five cats, escaped in search of a mate. “Too slow, sweetie!” he told her. She yowled at him in complaint. “I keep her fat so she’s slow,” he said. “And I don’t feel so piggy myself.”<a class="calibre22"></a>
This is on Sigil 0.5.3

Strange.

paulfiera 09-25-2012 12:15 PM

Thaks everybody.

It seems that Diapdealer's regex...

Quote:

<a class="([^>]*?)" id="([^>]*?)"></a>
does not find anchor tags with text or another tags between the opening tag and the closing tag.

The only result found in the same book is:

Quote:

<a class="calibre16" href="../Text/Imajica_split_199.html#filepos2408474" id="filepos163573"></a>
but it still finds anchor tags with href inside.

Jellby 09-25-2012 12:38 PM

Yes, it does because it searches for "any character but >" inside the quotes, and that includes the closing quote and the href part.

You probably want something like this:

Code:

<a class="([^"]*?)" id="([^"]*?)"></a>
Anyway, an anchor with href and no text is pretty useless.

paulfiera 09-25-2012 12:54 PM

Thanks, Jellby

Quote:

Originally Posted by Jellby (Post 2236713)
Yes, it does because it searches for "any character but >" inside the quotes, and that includes the closing quote and the href part.

You probably want something like this:

Code:

<a class="([^"]*?)" id="([^"]*?)"></a>

That seems to be working alright

Quote:

Originally Posted by Jellby (Post 2236713)
Anyway, an anchor with href and no text is pretty useless.

Agree :)

DiapDealer 09-25-2012 03:52 PM

Quote:

Originally Posted by Jellby (Post 2236635)
<a class="whatever" href="#here">this is a link</a> <a class="other" id="something"></a>

The whole red part would be matched by the first (.*?), right?

Yep. I missed that. Quite obvious when you see it all spelled out in red & black. ;)

This discussion is a perfect example of why I've started avoiding (.*?) if at all possible. It'll always bite you in the ass if it can.

meme 09-25-2012 03:57 PM

Quote:

Originally Posted by DiapDealer (Post 2236905)
Yep. I missed that. Quite obvious when you see it all spelled out in red & black. ;)

This discussion is a perfect example of why I've started avoiding (.*?) if at all possible. It'll always bite you in the ass if it can.

Good thing there is now an option where you can enable or disable its use in the beta :)

meme 09-25-2012 04:00 PM

In case you haven't noticed, there is a new Search Editor in the 0.6.0 beta that allows you to save your searches (and to run them from a separate dialog if you want). You can even run a group of searches in order.

Some sample regexes are loaded if your list is empty (or if you import the examples/search_entries.ini file).

You can export and import entries. So it might be interesting if you post searches you might want to see in the default examples files, and also searches that others might want to import.

DiapDealer 09-25-2012 04:18 PM

Quote:

Originally Posted by meme (Post 2236915)
In case you haven't noticed, there is a new Search Editor in the 0.6.0 beta that allows you to save your searches (and to run them from a separate dialog if you want). You can even run a group of searches in order.

Some sample regexes are loaded if your list is empty (or if you import the examples/search_entries.ini file).

You can export and import entries. So it might be interesting if you post searches you might want to see in the default examples files, and also searches that others might want to import.

OK, this is me, swooning a bit.

I haven't had much time to play with the latest beta yet, but I can see I need to make the time. :)

Doitsu 09-25-2012 04:26 PM

Quote:

Originally Posted by meme (Post 2236915)
In case you haven't noticed, there is a new Search Editor in the 0.6.0 beta that allows you to save your searches (and to run them from a separate dialog if you want). You can even run a group of searches in order.

That's a cool feature that I actually missed. How about adding a Search Editor... button to the Find and Replace dialog?

meme 09-25-2012 05:21 PM

Quote:

Originally Posted by Doitsu (Post 2236954)
That's a cool feature that I actually missed. How about adding a Search Editor... button to the Find and Replace dialog?

Try the Tools Menu. All the fun stuff hides there :)

kiwidude 09-25-2012 06:09 PM

@meme - I am shocked you also forgot to suggest they try the right-click menu on the Find dropdown, to quickly recall a saved search or add the current one to the saved searches... :)

crutledge 09-25-2012 06:30 PM

Finding strings only contained in <p>....</p>
 
Some ebooks capitalize for emphasis and some capitalize all proper names.

The following experssion easily finds all cap words in a file: (\w{Lu}+\w).
The problem is that it finds all caps to inclued those in headers and other places where caps are wanted.

I have been trying for some time to build a regex that will limit itself the those cap words between <p> tags with no success.

Is there a way to do this?

Jellby 09-26-2012 04:56 AM

Doesn't this work?:

Code:

<p[> ].*(\w{Lu}+\w)
.*</p>[/quote]

(it needs the dot not to match newlines, and it would only find one word per paragraph)

In similar cases, I often find it easier to mark someway the words I don't want to match by adding some otherwise unused character (¬ or | are good candidates), then it's easier to match what I do want to match, and I can remove the marking character easily at the end.

Doitsu 09-26-2012 04:58 AM

A quick and dirty solution would be:

Find:([[:upper:]]{2,})(.*?)</p>
Replace:<i>\L\1\E</i>\2</p>

This regular expression searches for uppercase words with at least two uppercase letters and will convert them to lower case italics. (For other case transformation examples see my other post).

Since this expression will only match one uppercase word per paragraph, you'll have to run it repeatedly if your paragraphs contain multiple uppercase words.
Theoretically, it might also miss some uppercase words or match more than one paragraph. I.e. don't use it with Replace All.

If this regular expression actually works for you, please do me a favor and upload a fewer books. ;)

JMikeD 09-27-2012 07:51 PM

I have a numer of older books that have been through the OCR process and ended up with paragraph breaks in the middle of sentences. In Open Office, I could get almost al of these fixed by using a regex:

Find: \p([a-z])
Replace: \1\2

I don't seem to be able to get a similar function to work in the Find and Replace of Sigil. The HTML code looks like:

Quote:

<p class="calibre"><span>bad policy to answer a</span></p>

<p class="calibre"><span>direct question. He kept shaking his head like a china figure.
I need to be able to glue sentences such as this back together. Any ideas?

Thanks.

Doitsu 09-27-2012 08:13 PM

I'm sure that there's a more elegant solution, but you could simply search for a paragraph ending in a lowercase letter or a punctuation sign followed by a paragraph starting with a lowercase letter and then join them with a space.

Code:

Find:([[:lower:]],*;*:*)</span></p>\n\n  <p class="calibre"><span>([[:lower:]])
Code:

Replace:\1 \2
(The regex assumes that Tidy is on and that there are two spaces before each <p>.)

DiapDealer 09-27-2012 08:35 PM

Quote:

The HTML code looks like:

Code:

<p class="calibre"><span>bad policy to answer a</span></p>

<p class="calibre"><span>direct question. He kept shaking his head like a china figure.


Ugh. Those empty spans surrounding literally everything are always a pain in the ass. You'll almost surely need to get rid of them first. The problem is ... there can be nested spans (italics/bolds/etc) within them. And that makes it quite painful to regex them away (without funkifying your "real" formatting spans).

If I have the original text to proof against, I sometimes find it easier (and less frustrating) just to blast ALL the spans away. Every single one. And then redo any italic and/or other special formatting using the physical copy as a guide. It's drastic, yes, but sometimes it's less drastic than fixing the havoc that a regex run on nested spans can wreak.

In one fell swoop, all span tags (opening and closing) ... gone (when you replace it with nothing of course):
Code:

</?span[^>]*?>
It all depends on the complexity of the book's formatting, of course. I may not always opt for the "nuclear" span removal approach, but I've done it quite a few times.

Use with an appropriate level of trepidation, of course... ;)

JMikeD 09-28-2012 01:15 AM

Quote:

Originally Posted by DiapDealer (Post 2239590)

Use with an appropriate level of trepidation, of course... ;)

It's probably just as easy to export the entire thing to RTF, clean everything up in OpenOffice and use the ePub Export extension in OO. That gives pretty clean results.

Jellby 09-28-2012 05:01 AM

I'd first identify the spans that do something (search for "<span ", replace them with something more meaningful (<i>, <strong>...), or with some other temporary mark), then delete the remaining bogus spans.

WS64 09-29-2012 04:01 PM

I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.

Then search for
</p>

<p class="calibre">([a-z])
and replace it with
_\1
(_ = blank)

Also search for
([a-zA-Z,])</p>

<p class="calibre">
and replace it with
\1_
(_ = blank)

mrmikel 10-23-2012 08:14 AM

Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR.

\. ([a-z])

Not a candidate for auto search and replace because it matches abbreviations, too.

Toxaris 10-24-2012 02:10 AM

Quote:

Originally Posted by WS64 (Post 2241622)
I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.

I would strongly recommend not to do that. If there are nested spans, Tidy doesn't always remove the correct closing span. That can make a real mess out of your book.

It is never a good idea to trust Tidy to make the right choice....

roger64 10-25-2012 01:58 AM

Now that Sigil has a nice Search Editor, I can add some more regex.
I would like to set up this one:

It's about superscript text:

The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a &nbsp;

Say <sup>me</sup>(normal space) should be replaced by
<sup>me</sup>&nbsp;

(me, lle are superscript short for M(adame), M(ademoiselle)...

I hope I have been clear enough... :o

Perkin 10-25-2012 06:29 AM

Search (put a space at end after </sup>)
Code:

<sup>(me|mes|lle|lles|er|e|o)</sup>
Replace
Code:

<sup>\1</sup>&nbsp;

roger64 10-25-2012 06:57 AM

@Perkin

Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes...

:thanks::thumbsup:

Already in use.

roger64 12-23-2012 09:47 AM

To replace hyphen with non-breaking hyphen
 
Hi

I try to set up a regex for French language.

We have some acronyms linked with hyphens 8208 like
J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209
I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word.

I think there must be better than this.
I wrote only 8208 instead of the full &#...:

Search: (J.|P.)8208(C.|D.G)
Replace: \18209\2


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.