![]() |
A quick and dirty solution would be:
Find: (chapter) ([[:lower:]]+) Replace: \u\1 \u\2 This requires Sigil 0.5.3 (or higher). |
Quote:
Quote:
|
Quote:
AFAIK, Calibre uses the Python regular expression library, which doesn't support the \u operator. The expressions that I suggested will work in Sigl or any text editor with PCRE support. Is there any particular reason why want to use Calibre to replace the text? |
Quote:
|
@Doitsu/Jellby
I was reading through the regex sticky and posted my question before realizing which software forum I was in. I have been able to glean very helpful information. Though the other forum has a regex sticky it does not seem as detailed as this one. I have moved my query here. |
I hate my first post to be a question rather than an answer but needs must when the devil drives.
I have an epub where speech quotes are missing from the start of the line e.g. Quote:
This seems to work (I know there are cases where it fail, but I'm just finding and not auto-fixing) Quote:
Is there something special about Sigil's regex that I'm overlooking? Many thanks in advance |
IMHO, the problem is \w*\W*, which matches a sequence of 0 or more word characters followed by 0 or more non-word characters. I.e., it will at most match one word plus a space or punctuation character. Try .*? instead:
Code:
"\>[^"](.*?)" |
Quote:
Outside of Sigil the regex that I was using worked fine which is what I'm finding odd. Using Quote:
Quote:
Need to think about this, I'm missing something really obvious :smack: |
Quote:
|
Strange issue
Most surely, I'm not understanding this the right way.
I'm cleaning up some epubs and have noticed that some of them have anchor tags with a class and an id but without any hyperlink. Some epubs have several hundred in between the text. So I'm using this regex to find anchor links with nothing inside them Quote:
I would like to be able to restrict the findings to only this situation. Many thanks! |
Quote:
|
I'm not certain why that expression would match instances with spans or text inside the anchor tags. It shouldn't really.
You might try: Code:
<a class="([^>]*?)" id="([^>]*?)"></a>But I can't get your expression to misbehave, really. It seems to do (for me anyway) what you've intended it to do. Can you give any examples of code it's matched that you don't think it should match? |
Quote:
which is why it is matching </span></a> if your id has an ending numbers use that to narrow the scope:(.+?\d+)"></a> |
Quote:
The whole red part would be matched by the first (.*?), right? |
Thanks, Doitsu and DiapDealer
This is from Clive Barker's Imajica Using Quote:
Clicking on Find, the first match is this one: Quote:
Quote:
Strange. |
Thaks everybody.
It seems that Diapdealer's regex... Quote:
The only result found in the same book is: Quote:
|
Yes, it does because it searches for "any character but >" inside the quotes, and that includes the closing quote and the href part.
You probably want something like this: Code:
<a class="([^"]*?)" id="([^"]*?)"></a> |
Thanks, Jellby
Quote:
Quote:
|
Quote:
This discussion is a perfect example of why I've started avoiding (.*?) if at all possible. It'll always bite you in the ass if it can. |
Quote:
|
In case you haven't noticed, there is a new Search Editor in the 0.6.0 beta that allows you to save your searches (and to run them from a separate dialog if you want). You can even run a group of searches in order.
Some sample regexes are loaded if your list is empty (or if you import the examples/search_entries.ini file). You can export and import entries. So it might be interesting if you post searches you might want to see in the default examples files, and also searches that others might want to import. |
Quote:
I haven't had much time to play with the latest beta yet, but I can see I need to make the time. :) |
Quote:
|
Quote:
|
@meme - I am shocked you also forgot to suggest they try the right-click menu on the Find dropdown, to quickly recall a saved search or add the current one to the saved searches... :)
|
Finding strings only contained in <p>....</p>
Some ebooks capitalize for emphasis and some capitalize all proper names.
The following experssion easily finds all cap words in a file: (\w{Lu}+\w). The problem is that it finds all caps to inclued those in headers and other places where caps are wanted. I have been trying for some time to build a regex that will limit itself the those cap words between <p> tags with no success. Is there a way to do this? |
Doesn't this work?:
Code:
<p[> ].*(\w{Lu}+\w)(it needs the dot not to match newlines, and it would only find one word per paragraph) In similar cases, I often find it easier to mark someway the words I don't want to match by adding some otherwise unused character (¬ or | are good candidates), then it's easier to match what I do want to match, and I can remove the marking character easily at the end. |
A quick and dirty solution would be:
Find:([[:upper:]]{2,})(.*?)</p> Replace:<i>\L\1\E</i>\2</p> This regular expression searches for uppercase words with at least two uppercase letters and will convert them to lower case italics. (For other case transformation examples see my other post). Since this expression will only match one uppercase word per paragraph, you'll have to run it repeatedly if your paragraphs contain multiple uppercase words. Theoretically, it might also miss some uppercase words or match more than one paragraph. I.e. don't use it with Replace All. If this regular expression actually works for you, please do me a favor and upload a fewer books. ;) |
I have a numer of older books that have been through the OCR process and ended up with paragraph breaks in the middle of sentences. In Open Office, I could get almost al of these fixed by using a regex:
Find: \p([a-z]) Replace: \1\2 I don't seem to be able to get a similar function to work in the Find and Replace of Sigil. The HTML code looks like: Quote:
Thanks. |
I'm sure that there's a more elegant solution, but you could simply search for a paragraph ending in a lowercase letter or a punctuation sign followed by a paragraph starting with a lowercase letter and then join them with a space.
Code:
Find:([[:lower:]],*;*:*)</span></p>\n\n <p class="calibre"><span>([[:lower:]])Code:
Replace:\1 \2 |
Quote:
If I have the original text to proof against, I sometimes find it easier (and less frustrating) just to blast ALL the spans away. Every single one. And then redo any italic and/or other special formatting using the physical copy as a guide. It's drastic, yes, but sometimes it's less drastic than fixing the havoc that a regex run on nested spans can wreak. In one fell swoop, all span tags (opening and closing) ... gone (when you replace it with nothing of course): Code:
</?span[^>]*?>Use with an appropriate level of trepidation, of course... ;) |
Quote:
|
I'd first identify the spans that do something (search for "<span ", replace them with something more meaningful (<i>, <strong>...), or with some other temporary mark), then delete the remaining bogus spans.
|
I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.
Then search for </p> <p class="calibre">([a-z]) and replace it with _\1 (_ = blank) Also search for ([a-zA-Z,])</p> <p class="calibre"> and replace it with \1_ (_ = blank) |
Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR.
\. ([a-z]) Not a candidate for auto search and replace because it matches abbreviations, too. |
Quote:
It is never a good idea to trust Tidy to make the right choice.... |
Now that Sigil has a nice Search Editor, I can add some more regex.
I would like to set up this one: It's about superscript text: The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a Say <sup>me</sup>(normal space) should be replaced by <sup>me</sup> (me, lle are superscript short for M(adame), M(ademoiselle)... I hope I have been clear enough... :o |
Search (put a space at end after </sup>)
Code:
<sup>(me|mes|lle|lles|er|e|o)</sup>Code:
<sup>\1</sup> |
@Perkin
Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes... :thanks::thumbsup: Already in use. |
To replace hyphen with non-breaking hyphen
Hi
I try to set up a regex for French language. We have some acronyms linked with hyphens 8208 like J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209 I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word. I think there must be better than this. I wrote only 8208 instead of the full &#...: Search: (J.|P.)8208(C.|D.G) Replace: \18209\2 |
| All times are GMT -4. The time now is 07:52 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.