MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

mmat1 04-08-2012 06:58 AM

Quote:

Originally Posted by roger64 (Post 2034166)
I sent the epub with one dropcap.

I've seen it, but meanwhile i tested this:

Code:

<p class="let">([A-Z])(\w{0,20}\s)((\w{0,20}\s){3})

<p class="let"><span class="let1 let2">\1</span><span class="smcpTypeV">\2\3</span>

It handles the first 4 words of an "let"-paragraph. (Perkins suggestion built in) Maybe you give it a try.

roger64 04-08-2012 07:08 AM

1 Attachment(s)
@mmat No, getting strange results..
@Perkin No, same as above.

Sending the source odt as well

Perkin 04-08-2012 07:09 AM

I've found a problem/difference with replacing once and replacing all, using both my and mmat1's replace strings, when doing a step replace the last space is inside the span, but just doing a replace all, the last space is outside it.

Just posting a report in 0.5.3 release thread.

mmat1 04-08-2012 07:16 AM

Quote:

Originally Posted by Perkin (Post 2034178)
I've found a problem/difference with replacing once and replacing all, using both my and mmat1's replace strings, when doing a step replace the last space is inside the span, but just doing a replace all, the last space is outside it.

It's maybe just tidy ??

Perkin 04-08-2012 07:18 AM

Using your epub and concatenating the two let1/2 styles and removing the now extraneous </span>, I get a odd result as well, so it's not the actual regex, it's the css :D

looking into it.

@mmat1, tidy is off, and even if on shouldn't alter that anyway.

mmat1 04-08-2012 07:25 AM

Quote:

Originally Posted by roger64 (Post 2034177)
@mmat No, getting strange results..

I'll look on your source.

@perkin:You're right, Sigil shouldn't do this, but sometimes it does unexpected things...

Perkin 04-08-2012 07:29 AM

Until the css is sorted, you can just use mmat1's solution with(out) the combined let1/2 styles.

Code:

<p class="let">([A-Z])(\w{0,20}\s)((\w{0,20}\s){3})

<p class="let"><span class="let1"><span class="let2">\1</span></span><span class="smcpTypeV">\2\3</span>

That should do. At least it's only one step - which is what you asked for originally ;).

mmat1 04-08-2012 07:43 AM

Quote:

Originally Posted by Perkin (Post 2034191)
Until the css is sorted, you can just use mmat1's solution with(out) the combined let1/2 styles.

A last suggestion from my side, i notice that it will not work, if there are accented characters, so search should better look like this:

Code:

<p class="let">([A-Z])([^ ]{0,20}\s)(([^ ]{0,20}\s){3})

Perkin 04-08-2012 07:51 AM

And I've noticed that if/because the spans are combined, the measurements are now from the new fontsize, which is 4.6em rather from default fontsize

roger64 04-08-2012 08:04 AM

First thanks to the three of you for your quick and efficient help.

Quote:

Originally Posted by mmat1 (Post 2034203)
A last suggestion from my side, i notice that it will not work, if there are accented characters, so search should better look like this:

Code:

<p class="let">([A-Z])([^ ]{0,20}\s)(([^ ]{0,20}\s){3})

Thanks a lot, this time it's working!! And yes, I did not use \w because I needed to take care of accented characters, punctuation signs, apostroph and so on which can all be found in the first four words of French texts.

I will study it and check Perkin's measurements. :)

Have some of you any idea how to execute one after another, in automatic mode, several regex? (not related to each other like these two) Is there a tool that can do it?

Perkin 04-08-2012 08:10 AM

Right, create a new style in the css (or adapt your existing one)
Code:

.let3{
  display      : block;
  float        : left;
  margin-left  : 0.28em;
  margin-top  : -0.18em; /* essayer -0.20em pour deux lignes */
  margin-right : 0.0em;
  font-family  : 'Times New Roman';
  font-size    : 4.6em;  /* essayer 3.33em pour deux lignes */
  height      : 1em;
}

Edit: I don't know if you want to change the french comments :)

Then using
Code:

<p class="let">([A-Z])([^ ]{0,20}\s)(([^ ]{0,20}\s){3})
Code:

<p class="let"><span class="let3">\1</span><span class="smcpTypeV">\2\3</span>

Perkin 04-08-2012 08:33 AM

Quote:

Originally Posted by roger64 (Post 2034219)
Have some of you any idea how to execute one after another, in automatic mode, several regex? (not related to each other like these two) Is there a tool that can do it?

It's easier to do on the html before you get it into Sigil, but one tool that can work on multiple files as well as running multiple S&R's is PowerGREP

It's not as complicated as it first looks, once you get to using it. If you do a sequence, that runs multiple s&r's over any files etc.

They also do a few other brilliant regex products (testing/building/explaining) and a texteditor (which is my preferred editor).

roger64 04-08-2012 09:46 AM

@Perkin

After successive refinements, this looks better indeed. I will try this to morrow.

I go straight from odt to EPUB. I just tweak the EPUB a little after converting, but most of the work is done without touching directly any html file.

I am a Linux user, but I see now which kind of tool can do it. My idea was to add to our existing EPUB converting program a kind of editable super macro (the user could insert or modify any Regex inside). But, if the user needs Power Grep to use it, it would be a self-defeating purpose. On the other hand, if Power Grep can help me prepare this kind of super macro which later could be used without it, as a kind of program, yes it would be worthwhile.

mmat1 04-08-2012 09:56 AM

Quote:

Originally Posted by roger64 (Post 2034219)
Have some of you any idea how to execute one after another, in automatic mode, several regex? (not related to each other like these two) Is there a tool that can do it?

Powergrep looks really powerful at first glance, at second glance I saw the price ... :)

So I may mention the command-line oriented unix-tools sed or awk, which are available for windows as well. With awk you can do nearly anything, but you'll have to build some skills first...

Perkin 04-08-2012 10:12 AM

Quote:

Originally Posted by mmat1 (Post 2034325)
Powergrep looks really powerful at first glance, at second glance I saw the price ... :)

So I may mention the command-line oriented unix-tools sed or awk, which are available for windows as well. With awk you can do nearly anything, but you'll have to build some skills first...

I agree, need to learn more regex fu. ;) I haven't used PowerGrep for ages, and must have been demo, didn't realise it was that dear. My toolbar and smilies aren't working, had to add < b > tags myself, was working earlier though.

mmat1 04-08-2012 12:02 PM

Quote:

Originally Posted by Perkin (Post 2034332)
I agree, need to learn more regex fu. ;) I haven't used PowerGrep for ages, and must have been demo, didn't realise it was that dear. My toolbar and smilies aren't working, had to add < b > tags myself, was working earlier though.

Uhm, did I state correctly what i meant ? "Building skills" is meant in order going to use "awk". That has nothing to do with regex, it's just the art of tricky skript-programming. With awk, you can have a html as input and get a list of primes as output (each word replaced by a prime in order).
And it was not stated in any context of your skills, if it was understood this way i must apologize.

Perkin 04-08-2012 12:56 PM

Quote:

Originally Posted by mmat1 (Post 2034450)
Uhm, did I state correctly what i meant ? "Building skills" is meant in order going to use "awk". That has nothing to do with regex, it's just the art of tricky skript-programming. With awk, you can have a html as input and get a list of primes as output (each word replaced by a prime in order).
And it was not stated in any context of your skills, if it was understood this way i must apologize.

No need for apology, half my fault at not reading it correctly.
Not having used awk I just thought it was another grep style prog using regex - again my fault.
(My brain is only half working at moment 'cause of medicaton - at least that's what I'm blaming it on :) - I think I've used today's lucidity quota up earlier, on the actual regex problem and css adjustment.)

mmat1 04-08-2012 04:12 PM

Quote:

Originally Posted by Perkin (Post 2034533)
I just thought it was another grep style prog using regex - again my fault.

"Sed" is a grep-style regex tool, "awk" is much more, it's actually a script-programming environment.

I hope, you'll be better soon.

roger64 04-09-2012 12:42 AM

I tried your solutions: both are working.

I have rather stay with mmat's though because with the second one I found it a little more tricky to finetune the dropcap position. BTW, a good part of my short science on dropcaps come from here.

Thanks again and take care.

NB: I'll have a look at "awk".

mncowboy 04-17-2012 10:39 AM

I have a few documents in MS Word that I'm going to convert to ebooks. The documents have a lot of endnotes, and as usual, Word puts out a lot of junk when saved a html. I am very new to regex, and was wondering if I could get help. I want to search on:
Code:

<a href="#_edn1" name="_ednref1" title=""><span class="MsoEndnoteReference"><span class="MsoEndnoteReference"><b><span style="font-size:8.0pt;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;;color:black">[1]</span></b></span></span></a>
and replace it with:
Code:

<a href="#_edn1" name="_ednref1" title=""><sup>[1]</sup></a>
The endnote numbers are from 1 to 125. Is there a single regex that can do this in Sigil?
Thanks in advance.

DiapDealer 04-17-2012 11:11 AM

Quote:

Originally Posted by mncowboy
Is there a single regex that can do this in Sigil?
Thanks in advance.

Find:
Code:

<a href="#_edn(\d+)" name="_ednref(\d+)" title=""><span class="MsoEndnoteReference"><span class="MsoEndnoteReference"><b><span style="font-size:8.0pt;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;;color:black">\[(\d+)\]</span></b></span></span></a>
Replace:
Code:

<a href="#_edn\1" id="_ednref\2" title=""><sup>[\3]</sup></a>
Or if you know, absolutely, that the numbers will always be the same across each instance:
Find:
Code:

<a href="#_edn(\d+)" name="_ednref\d+" title=""><span class="MsoEndnoteReference"><span class="MsoEndnoteReference"><b><span style="font-size:8.0pt;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;;color:black">\[\d+\]</span></b></span></span></a>
Replace:
Code:

<a href="#_edn\1" id="_ednref\1" title=""><sup>[\1]</sup></a>
NOTE: I replaced the "name" attribute with "id" because "name" is old and tired. ;)
EDIT: The above stuff is all based on the assumption that the <b>, <span>, and font-family/size stuff is identical in all of the original endnote code instances. You'd need to make judicious use of (.*?) if not.
(and I had a mistake in the first edition of this post that I corrected)

mncowboy 04-17-2012 11:43 AM

Thank you sir!!
Changing name= to id= is one of the first S&R I do on a document.

You would think that Word would have changed over by now.

GRiker 04-20-2012 09:09 AM

Specifying space character in replace field?
 
With the old regex engine, I could use '\x20' to specify a space in the replacement pattern, but that no longer works in the current version.

Other than using a literal space, how do I specify a space character in the replace field? (I don't want to use a literal space, because I often save my s/r patterns in a development notes file, and they're hard to see in plain text.)

Perkin 04-20-2012 01:37 PM

You could use

Code:

& #32;
(remove the space)

Edit: I think you might only be able use that if the replace is part of text - not inside a tag.

GRiker 04-20-2012 01:53 PM

Quote:

Originally Posted by Perkin (Post 2050688)
You could use

Code:

& #32;
(remove the space)

Edit: I think you might only be able use that if the replace is part of text - not inside a tag.

That would work, but I was looking for something that would insert an actual ASCII space in the text, rather than an entity (for readability).

After lots of experimenting, I discovered that I could use
Code:

\U \E
as the replacement term but that seems indirect and inelegant. But it works.

G

roger64 05-11-2012 12:40 PM

Hi

It's just a small question. To select letters intended to become dropcaps, I use this part of a Regex:
([A-Z])

However, I realize this does not select accented capitals that do exist in French (like É, À, Ô and so on). Of course, I can just suppress their accents. But if I wish to make a drop-cap out of an accented capital, what would be the code?

([.]) is a catch-all. Have you better?

theducks 05-11-2012 02:02 PM

Quote:

Originally Posted by roger64 (Post 2077906)
Hi

It's just a small question. To select letters intended to become dropcaps, I use this part of a Regex:
([A-Z])

However, I realize this does not select accented capitals that do exist in French (like É, À, Ô and so on). Of course, I can just suppress their accents. But if I wish to make a drop-cap out of an accented capital, what would be the code?

([.]) is a catch-all. Have you better?

([A-ZÉÀÔ])

the dash just means range. the normal is any one of these. You can use both as I have

DiapDealer 05-11-2012 03:51 PM

Quote:

Originally Posted by roger64 (Post 2077906)
However, I realize this does not select accented capitals that do exist in French (like É, À, Ô and so on). Of course, I can just suppress their accents. But if I wish to make a drop-cap out of an accented capital, what would be the code?

([.]) is a catch-all. Have you better?

Code:

\p{Lu}
Will catch all upper-case letters (including unicode characters), if that's what you're looking for. Add parentheses to make it a capture group if desired, of course.

roger64 05-12-2012 12:10 PM

@DiapDealer, theducks

Thanks very much for your replies. As this regex is intended to be used for French texts, I will use theducks' proposal. I just did not know one could add letters this way as I did not see any example of it.

DiapDealer 05-12-2012 12:47 PM

Quote:

Originally Posted by roger64 (Post 2079076)
@DiapDealer, theducks

Thanks very much for your replies. As this regex is intended to be used for French texts, I will use theducks' proposal. I just did not know one could add letters this way as I did not see any example of it.

Just so you know, it doesn't matter what language it is. If it's a valid uppercase letter (including unicode characters with acute, grave, breve umlauts—any valid diacritic, really), (\p{Lu}) will capture it. But whatever you're comfortable with is the way to go. ;)

roger64 05-25-2012 11:41 AM

@DiapDealer

Did not see you reply in time. It really needed your explanation. Yes of course, this is also a very convenient solution. I note it. Thanks again.

paulfiera 06-05-2012 02:44 PM

Change Chapter text to Heading
 
How can I change in Sigil all the occurrences of "Chapter" like the following example:
Quote:

Chapter One
Where "One" can be "Two", "Three", and so on...

...or even "1", "2", "3",...


with

Quote:

<h1>Chapter [the text that goes after the word Chapter]</h1>
Many thanks!


Edit: Never mind I think I found the solution in JeremyR's post. Many thanks, JeremyR :2thumbsup

meme 06-05-2012 02:59 PM

You don't say what the original Chapter One looks like in code view. Just the text isn't sufficient to make sure the find/replace is correct.

Assuming you have
Code:

<p>Chapter SOMETHING</p>
and want
Code:

<h1>Chapter SOMETHING</h1>
then
Code:

Find:    (?sU)<p>Chapter (.*)</p>
Replace: <h1>Chapter \1</h1>

might get you what you want.

paulfiera 06-05-2012 03:09 PM

Quote:

Originally Posted by meme (Post 2104368)
You don't say what the original Chapter One looks like in code view. Just the text isn't sufficient to make sure the find/replace is correct.

Assuming you have
Code:

<p>Chapter SOMETHING</p>
and want
Code:

<h1>Chapter SOMETHING</h1>
then
Code:

Find:    (?sU)<p>Chapter (.*)</p>
Replace: <h1>Chapter \1</h1>

might get you what you want.

Many thanks, meme. You are correct.

In Code View it is

Quote:

<p class="calibre4">CHAPTER 1</p>
Using JeremyR's code seems to do the trick. The Chapters are now

Quote:

<h3>CHAPTER 1</h3>
I omitted the horizontal line and used calibre to split the html at every h3.

Don't know if I'm doing this right though :)

I have more books with the same issue. I'll try with your code next time.

Many thanks.

roger64 06-19-2012 09:00 AM

Successive Find and Replace

I wish to clean an html text which suffers from recurrent mistakes from an OCR engine (Cuneiform).

When I meet one the mistakes, I make a replacement and I note it. After some pages, I met most of the mistakes and now I intend to build a regex, adding as many as 15 successive simple search and replace like the following two.
A@ → à
B@ → ç
I do not know how to perform these 15 F&R within a simple regex.Suppose I would like to build it for the two above, what should I write?

Nota: I already use utf8 for the whole text.

DiapDealer 06-19-2012 09:25 AM

Quote:

Originally Posted by roger64 (Post 2120333)
Successive Find and Replace

I wish to clean an html text which suffers from recurrent mistakes from an OCR engine (Cuneiform).

When I meet one the mistakes, I make a replacement and I note it. After some pages, I met most of the mistakes and now I intend to build a regex, adding as many as 15 successive simple search and replace like the following two.
A@ → à
B@ → ç
I do not know how to perform these 15 F&R within a simple regex.Suppose I would like to build it for the two above, what should I write?

Nota: I already use utf8 for the whole text.

I'm not sure what you're asking for is feasible. What you've described is something that would be more suited to an external program/algorithm (or a plugin) rather than one single Regular Expression. Finding all 15 with one expression wouldn't be the hard part... replacement based on "if/then" logic is where it would fall apart.

roger64 06-19-2012 11:07 AM

OK. Thanks for your answer. I will try to find another solution

Doitsu 06-19-2012 11:47 AM

You could create a simple sed script with one line for each character that you need to fix. E.g.

Code:

s/A@/à/g
s/B@/ç/g

Then simply save the lines as a utf8 text file (without BOM), e.g. fix.sed, and execute it with sed:

Code:

sed -f fix.sed -i *.html
(Note that this will overwrite the original files.)

roger64 06-19-2012 12:07 PM

@Doitsu

Wow!! It's working very well! Thanks a lot!!
What means BOM?

DiapDealer 06-19-2012 12:09 PM

Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.