MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

Doitsu 06-22-2020 12:32 PM

The following expression should do the trick:

Find:<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>
Replace:\1\L\2\E

Mister L 06-22-2020 02:02 PM

Quote:

Originally Posted by Doitsu (Post 4003348)
The following expression should do the trick:

Find:<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>
Replace:\1\L\2\E

Thanks, I use that one when there is only one span to remove, however it doesn't seem to work when there are several in the same phrase like in my example. I have checked "minimal match" but it still finds the first opening tag of the series and then the last closing tag, rather than each pair in succession.

Example:
<span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED,</span> <span class="Cap">O</span><span class="SmallCap">THER</span> <span class="Cap">W</span><span class="SmallCap">WORDS IN THE SENTENCE MAY OR MAY NOT BE CAPITALISED</span>

Desired result :
First word of the sentence is always capitalised, Other Words in the sentence may or may not be capitalised.

Actual result:
First word of the sentence is always capitalised,</span> <span class="cap">o</span><span class="smallcap">ther</span> <span class="cap">w</span><span class="smallcap">words in the sentence may or may not be capitalised

So I have lost some capital letters I want to preserve, and also for some reason even if I insert the cursor before the next opening tag of a complete pair, no other matches are found, and the code is broken.

I have not found any way to fix this other than doing it by hand, I really don't know if it's possible to do it with regex to be honest*.

Edit: *I should say, except by doing it in several passes, first
<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span> <span class="Cap">(.)</span><span class="SmallCap">(.*?)</span> <span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>

replace
\1\L\2\E \3\L\4\E \5\L\6\E

Then
<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span> <span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>

\1\L\2\E \3\L\4\E

Then just one pair.

But I'm interested to know if there is a way to manage all the cases in just one pass, in case you don't know in advance how many sets of spans there might be.

Doitsu 06-22-2020 02:49 PM

Quote:

Originally Posted by Mister L (Post 4003375)
Example:
<span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED,</span> <span class="Cap">O</span><span class="SmallCap">THER</span> <span class="Cap">W</span><span class="SmallCap">WORDS IN THE SENTENCE MAY OR MAY NOT BE CAPITALISED</span>

Desired result :
First word of the sentence is always capitalised, Other Words in the sentence may or may not be capitalised.

Actual result:
First word of the sentence is always capitalised,</span> <span class="cap">o</span><span class="smallcap">ther</span> <span class="cap">w</span><span class="smallcap">words in the sentence may or may not be capitalised

That's not the result that I'm getting. With:

Find:<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>
Replace:\1\L\2\E


I'm getting:

First word of the sentence is always capitalised, Other Wwords in the sentence may or may not be capitalised

(None of the Regex options are checked.) You'll need to uncheck the Minimal Match option.

Mister L 06-22-2020 04:53 PM

Quote:

Originally Posted by Doitsu (Post 4003391)
That's not the result that I'm getting. With:

Find:<span class="Cap">(.)</span><span class="SmallCap">(.*?)</span>
Replace:\1\L\2\E


I'm getting:

First word of the sentence is always capitalised, Other Wwords in the sentence may or may not be capitalised

(None of the Regex options are checked.) You'll need to uncheck the Minimal Match option.

Oh, it works if I uncheck Minimal Match! That's the exact opposite of what I would have expected, to be honest... I would expect "minimal match" to make it select the smallest amount of text possible to match the pattern but apparently not. Do you know why it does not work that way, at least in this case?

Well thanks very much for that tip, in future I will experiment more with minimal match and see when it is helpful and when not.

Doitsu 06-22-2020 05:54 PM

Quote:

Originally Posted by Mister L (Post 4003426)
Do you know why it does not work that way, at least in this case?

AFAIK, if Minimal Match is selected, Sigil will prefix the search string with (?U).

From the PCRE documentation:

Quote:

PCRE_UNGREEDY

This option inverts the "greediness" of the quantifiers so that they
are not greedy by default, but become greedy if followed by "?". It is
not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
(If you remove the question mark from my regex and select Minimal Match, it works as expected.)

Mister L 06-22-2020 08:34 PM

Quote:

Originally Posted by Doitsu (Post 4003460)
AFAIK, if Minimal Match is selected, Sigil will prefix the search string with (?U).

From the PCRE documentation:

(If you remove the question mark from my regex and select Minimal Match, it works as expected.)

Ok, very interesting, I did not realise it inverted whatever was already present.


My original question was about fixing chapter headings to make it possible to easily regenerate a TOC, as I recently had a file with chapter headings in this format. To get back to that, is it possible to keep the original text as is, but copy the modified text into a title attribute, bearing in mind there can be a variable number of sets of spans in the title?

I know how to do it with only one set, as I said, but except if I do it in multiple passes (3 sets then 2 sets then 1 set) I don't think this pattern works.

For instance, a heading (that may or may not have a class and may or may not have an ID), and the text is in fake small-caps. Some titles may have one or more capitalised words in the middle but not all of them. I want to add the title attribute in sentence case.

Find:
<h1 class="chapter" id="id01"><span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED</span></h1>

But also:
<h1 class="chapter" id="id01"><span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED,OTHER</span> <span class="Cap">W</span><span class="SmallCap">ORDS IN THE SENTENCE MAY OR MAY NOT BE CAPITALISED</span></h1>

And also:
<h1 class="chapter" id="id01"><span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED,</span> <span class="Cap">O</span><span class="SmallCap">THER</span> <span class="Cap">W</span><span class="SmallCap">ORDS IN THE SENTENCE MAY OR MAY NOT BE CAPITALISED</span></h1>

etc.

Replace:

<h1 class="chapter" id="id01" title="First word of the sentence is always capitalised, Other Words in the sentence may or may not be capitalised"><span class="Cap">F</span><span class="SmallCap">IRST WORD OF THE SENTENCE IS ALWAYS CAPITALISED,</span> <span class="Cap">O</span><span class="SmallCap">THER</span> <span class="Cap">W</span><span class="SmallCap">ORDS IN THE SENTENCE MAY OR MAY NOT BE CAPITALISED</span></h1>

(and other variations with zero or more capitalised words in the middle of the title)


I usually use some variation of this:
Search:
<h1 class="chapter" id="(.*)"><span class="Cap">(.)</span><span class="SmallCap">(.*)</span></h1>

Replace:
<h1 class="chapter" id="\1" title="\2\L\3\E"><span class="Cap">\2</span><span class="SmallCap">\3</span></h1>

It works if there is only one set of spans. If there is more than one set I have to multiply the search variables and the replace variables so there are the same number of sets of each, for iterations of the same search, or correct them by hand one by one as I go along.

luciaisacat 07-25-2020 07:06 AM

Hi! I have chapters without titles and would like to add, to each one of them, their file names in the body text... is this regexable? Thanks!

:help:

L

Turtle91 07-25-2020 09:20 AM

Quote:

Originally Posted by luciaisacat (Post 4015849)
Hi! I have chapters without titles and would like to add, to each one of them, their file names in the body text... is this regexable? Thanks!

:help:

L

Yes, it is....probably....maybe..... :eek:

We have to have some kind of example to help you. Can you copy/paste a section of code that we can look at?

luciaisacat 07-25-2020 09:32 AM

Sure, thanks for helping....

The recurrent code for each chapter/file is the following :
<head>
[....]
</head>
<body>
<h2>Chapter</h2>
My idea would be to Find/Replace in regex mode: <h2>Chapter</h2> with <h2><filename></h2> .... do you know if there is a regex code for <filename>?

Thanks




Quote:

Originally Posted by Turtle91 (Post 4015867)
Yes, it is....probably....maybe..... :eek:

We have to have some kind of example to help you. Can you copy/paste a section of code that we can look at?


DiapDealer 07-25-2020 10:27 AM

Unfortunately, I don't think what you want is possible. Not with Sigil's built-in regex Search & Replace feature anyway. Search and Replace doesn't know what the filename is. In this instance, it's merely searching the content given to it. Which is the contents of the file. The regex engine is unaware that what it's searching for (and replacing) is even part of a file. It's immaterial.

You (or someone) may be able to construct a sigil plugin that could accomplish this, but unless it's something you foresee happening a lot, developing a plugin to handle it would likely be more involved than just manually updating the chapter headers.

Sorry.

luciaisacat 07-25-2020 10:50 AM

OK... Thanks!

:thanks:

Quote:

Originally Posted by DiapDealer (Post 4015881)
Unfortunately, I don't think what you want is possible. Not with Sigil's built-in regex Search & Replace feature anyway. Search and Replace doesn't know what the filename is. In this instance, it's merely searching the content given to it. Which is the contents of the file. The regex engine is unaware that what it's searching for (and replacing) is even part of a file. It's immaterial.

You (or someone) may be able to construct a sigil plugin that could accomplish this, but unless it's something you foresee happening a lot, developing a plugin to handle it would likely be more involved than just manually updating the chapter headers.

Sorry.


davidfor 07-25-2020 11:10 AM

You should be able to do it with the calibre editor. You can create a regex-functions. One of the parameters the file name where the match is found. A really dumb function to do this is:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group(0) + '<h2>' + file_name + '</h2>'

With this, using the search term:

Code:

(<body>)
Gave me:

Code:

<body><h2>OEBPS/Text/Section0001.xhtml</h2>
Which is the full path for the file. The function is written in Python, so you can do whatever you want with it. But, you could use the above to get the file name into each file, and then use other searches-and-replaces to arrange things the way you want.

And for the record, I have another function called "Number Chapter" which can be used to number chapters across the complete book in one go. I don't remember if it a supplied function, or one mentioned in the forum that I copied.

Klecks 07-26-2020 04:00 AM

3 Attachment(s)
Hi luciaisacat,

I have played with the idea from davidfor and came up with the following solution for the calibre editor:

1. open your file in the calibre editor.
2. call search/replace, change to Mode: "regex-function", insert your search string and then click "create/edit"
3. you will see a basis function.
4. insert a name for your new Function and replace the code with:
Code:

import re
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    #call file_name and strip file path and -extension:
    newName=(re.search(r'.*?\/([^\.\/]*)\.[^\.]*', file_name).group(1))
    newName=re.sub(r'_',r' ',newName)        #strip _
    newName=re.sub(r' 0',r' ',newName)        #strip leading zero
    #replace bracket term with newName:
    result=re.sub(match.group(1),newName,match.group())
    return result

5. confirm with OK and now you can use that function. It will replace whatever is between the brackets in your search string with the file name (minus path and file extension)


Klecks

Leonatus 09-09-2020 04:27 AM

Remove Kobo spans
 
I just want to contribute a bit, as I found a way to remove Kobo spans:

Search:
Code:

<span class="koboSpan" id="kobo.[0-9]{1,2}([.][0-9]{1,2})?">([^<>]+)</span>
Replace:
Code:

\2
Might be usefulto someone.

davidfor 09-09-2020 07:06 AM

Quote:

Originally Posted by Leonatus (Post 4032633)
I just want to contribute a bit, as I found a way to remove Kobo spans:

Search:
Code:

<span class="koboSpan" id="kobo.[0-9]{1,2}([.][0-9]{1,2})?">([^<>]+)</span>
Replace:
Code:

\2
Might be usefulto someone.

For the record, the Modify ePub plugin includes function to remove the Kobo spans. And, Diap's Editing Toolbag plugin can remove spans based on the class. I think that is the way I did it the last time I wanted to.


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.