MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

Jellby 12-23-2012 10:43 AM

Are there instances of hyphen after a period that you do not want to replace? If there aren't you can just replace all ".-" with ".¬" (where I use ¬ for the non-breaking hyphen), with appropriate escaping of the period if needed.

Doitsu 12-23-2012 10:46 AM

I'm sure that the Regex gurus will come up with a much more efficient Regex, but I'd simply search for a capital letter with a period followed by ‐ and another capital letter followed by a period:

Find: ([[:upper:]]\.)‐([[:upper:]]\.)
Replace: \1‑\2

This should work in Sigl and any other Editor with PCRE support.

roger64 12-23-2012 11:46 PM

Hi

Like for many things, I gather experience book after book. After preparing an history book, I realized that to use a hyphen for J.-C. (70 occurrences of it in one book) was NOT a nice idea.

I have no idea how many words of this kind I may find and I am really not sure that all occurrences of .- should deserve the same treatment. That's why, I thought first to add them one by one.

But, in fact, I realize there does not seem to be a very big risk to try your solutions. So I will try them. Thanks for them. :)

And enjoy a Merry Chrismas.

Jellby 12-24-2012 04:36 AM

Well, try searching for ".-" first and see which occurrences you find. With any luck you'll see they all want to be non-breaking, or you may see a pattern (like Doitsu's suggestion) and find some typos ;)

mzmm 01-09-2013 10:08 AM

found myself parsing messy html today, removing empty <p> tags, or <p> tags containing &nbsp;, or <p><i></i></p>, <p><b> </b><p> etc. so that i could space the paragraphs consistently in css, and, inspired by this thread, thought i'd share the snippet in case anyone has a use for it.

i realize it could probably be more concise, and i wouldn't just blindly replace all, but it seems to do the job. it removes <p> tags that may also contain <b>, <i>, <span>, have no content, or 1 or more spaces, or a <br>,<br/>,<br />.

Code:

<p[^>]*>((<\w+[^>/]*>)+)?(<br((\s)?/)?>|&nbsp;|\s*)((</\w+[^>]*>)+)?</p>

ReaderRabbit 01-13-2013 07:54 PM

Simple question about quotes
 
I have two very basic questions. First about finding straight quotes (") and replacing them with curved opening quotes [A-Z] then the different curved closing quotes using maybe [," or ." or ?" or !"].

And here's why. . . I want to eventually find paragraghs with broken quotes. Paragraphs that have a “ (opening quote) but not a ” (closing quote).

Does this make sense?

So this is a two part question.

Thanks so much . . . I know you brainyachs will have a solution. :thumbsup:

I am using Sigil v 0.6.2

theducks 01-13-2013 08:24 PM

Quote:

Originally Posted by ReaderRabbit (Post 2380624)
I have two very basic questions. First about finding straight quotes (") and replacing them with curved opening quotes [A-Z] then the different curved closing quotes using maybe [," or ." or ?" or !"].

And here's why. . . I want to eventually find paragraghs with broken quotes. Paragraphs that have a “ (opening quote) but not a ” (closing quote).

Does this make sense?

So this is a two part question.

Thanks so much . . . I know you brainyachs will have a solution. :thumbsup:

I am using Sigil v 0.6.2


Good luck :D
How do you tell where the missing quote belongs?
"That's nice," she said.
"I'll take it from here."

'Smarty' can handle most straight=>curly conversions, but I don't think it comes with a Ouija board plugin :rofl:

ReaderRabbit 01-13-2013 08:32 PM

There would be an opening quote in one paragraph and the closing quote in the next paragraph. I want to bring the two paragraphs together. I also prefer the curved quotes to the straight quotes in books.

theducks 01-13-2013 08:37 PM

Quote:

Originally Posted by ReaderRabbit (Post 2380645)
There would be an opening quote in one paragraph and the closing quote in the next paragraph. I want to bring the two paragraphs together. I also prefer the curved quotes to the straight quotes in books.

You need to fix the missing/join part. (There is a 'Saved Search: Join Paragraphs' that might help yo step through fairly quick Make a backup :thumbsup:)
Smarty (is used in Calibre and may be available elsewhere), can then do the curly conversion.

DiapDealer 01-13-2013 08:41 PM

This is simply something that regex doesn't lend itself very well to. Heuristic algorithms are better suited for this job (but can still fall short of being perfect).

ReaderRabbit 01-13-2013 08:59 PM

OK, I will check into making changes in Calibre tho I would not know how to distinguish between Opening and Closing quotes when straight quotes are used in the original text. I thought Regex would probably be good at that.

My thanks to theducks and DiapDealer

Doitsu 01-14-2013 04:10 AM

Quote:

Originally Posted by ReaderRabbit (Post 2380624)
I have two very basic questions. First about finding straight quotes (") and replacing them with curved opening quotes [A-Z] then the different curved closing quotes using maybe [," or ." or ?" or !"].

If the source text currently contains no curly quotation marks at all, you could use kiwidude's Modify ePub plugin. It has a Smarten Punctuation option that'll replace all straight quotes with curly quotes using a heuristic algorithm, which is pretty good, but not perfect.

Ahu Lee 01-15-2013 12:47 PM

I'm sorry I have gone through the thread, but couldn't find anything that would work for my needs. I've read the wiki article, tried in vain many different combinations, but I'm obviously doing something wrong.

I would appreciate a little help on this:
I need to make a simple match to grab this thing with whatever characters within the id name:
<h2 class="story_title" id="(whatever)">

Many many thanks!

----------Edited------------

I found one in the "Saved Searches" in Sigil, touched it up a bit and it did the job.
(?sU)<h2([^>]*>.*)

theducks 01-15-2013 09:20 PM

Quote:

Originally Posted by Ahu Lee (Post 2382813)
I'm sorry I have gone through the thread, but couldn't find anything that would work for my needs. I've read the wiki article, tried in vain many different combinations, but I'm obviously doing something wrong.

I would appreciate a little help on this:
I need to make a simple match to grab this thing with whatever characters within the id name:
<h2 class="story_title" id="(whatever)">

Many many thanks!

----------Edited------------

I found one in the "Saved Searches" in Sigil, touched it up a bit and it did the job.
(?sU)<h2([^>]*>.*)

You mean, like:
Code:

<h2 class="story_title" id="(.+?)">
which grabs whatever when enclosed wit all the other stuff shown

Ahu Lee 01-16-2013 02:58 PM

Quote:

Originally Posted by theducks (Post 2383367)
You mean, like:
Code:

<h2 class="story_title" id="(.+?)">
which grabs whatever when enclosed wit all the other stuff shown

Well, yes, I had tried that as well, but it didn't (and still doesn't \ I've just tried it again just in case) match anything. No matches found.

Though, I'm not sure I understand this (not English-wise, but just what it's all about :)):
Quote:

...when enclosed with all the other stuff shown
Could you elaborate on that for the newbie like me? Perhaps that's where the problem is.

Thank you very much!

theducks 01-16-2013 10:09 PM

Quote:

Originally Posted by Ahu Lee (Post 2384300)
Well, yes, I had tried that as well, but it didn't (and still doesn't \ I've just tried it again just in case) match anything. No matches found.

Though, I'm not sure I understand this (not English-wise, but just what it's all about :)):


Could you elaborate on that for the newbie like me? Perhaps that's where the problem is.

Thank you very much!

If that does not match, then, maybe :
1) you are not in REGEX mode
2) you have accidentally included a Leading or trailing space in the search selection <<< I choose this one :D

( ) delineate the captured (stuff :D ) text

LightFromMoon 01-17-2013 08:43 AM

Could anybody help me in this case?:

There is a code from EPUB:
Quote:

you</p>

<p class="calibre2">need
I need replace in whole document those symbols to a empty space to get only string "you need" and so on.
I tried to use:
(*[a-z])</p>

 <p class="calibre2">([a-z]*) <-- did not worked. Is said. that no match found.

mzmm 01-17-2013 09:04 AM

this should work, but i'd still go through the file and replace them one at a time. it'll match a </p> not preceded by a non-alphanumeric character (like ?, ., !, etc.)

Code:

find:

(?<!\W)</p>\s+<p[^>]*>

rep:

 <---- this is a single space


LightFromMoon 01-17-2013 09:19 AM

Quote:

Originally Posted by mzmm (Post 2385191)
this should work, but i'd still go through the file and replace them one at a time. it'll match a </p> not preceded by a non-alphanumeric character (like ?, ., !, etc.)

Code:

find:

(?<!\W)</p>\s+<p[^>]*>

rep:

 <---- this is a single space



Excuse me, but it didn't work for me. No matches found ....

mzmm 01-17-2013 09:22 AM

Quote:

Originally Posted by LightFromMoon (Post 2385204)
Excuse me, but it didn't work for me. No matches found ....

what are you using to edit the html?

it matches in sigil, make sure you have regex mode turned on.

Ahu Lee 01-17-2013 10:43 AM

theducks,

it was #1 :smack:. What a shame! :D

Thank you!

MuskratBooks 01-26-2013 12:43 AM

Find & fix quote in split paragraphs
 
In Sigil this expression has been helpful:
(“[^”\r\n]*)</p>\s+<p class="calibre.">
Replace with (has a trailing space): \1

This indentifies paragraphs where a opening smart quote is not matched with a closing smart quote and joins that paragraph with the next one. Its not fool proof, but saves a lot of time.

I use calibre conversion to switch straight quotes to smart quotes. Its under "Look Feel", check by "smarten punctuation". Easier to fix its mistakes than to find and fix 'em all.

Good Luck!

Perkin 01-26-2013 10:27 AM

Quote:

Originally Posted by MuskratBooks (Post 2397534)
In Sigil this expression has been helpful:
(“[^”\r\n]*)</p>\s+<p class="calibre.">
Replace with (has a trailing space): \1

This indentifies paragraphs where a opening smart quote is not matched with a closing smart quote and joins that paragraph with the next one. Its not fool proof, but saves a lot of time.

I use calibre conversion to switch straight quotes to smart quotes. Its under "Look Feel", check by "smarten punctuation". Easier to fix its mistakes than to find and fix 'em all.

Good Luck!

You have to be careful, quite a lot of books (especially older ones), have quoted multi paragraphs - usually a long speech, where the closing quotes are missing, because it continues in next paragraph, which (usually) starts with a quote.


In calibre, you can use the 'modify e-pub' plugin that can do the smarten punctuation, without a full conversion.

ditke 02-01-2013 06:52 AM

I've read previous posts but my problem is either not covered in them or I simply missed it.

How can I, in the Replace field in Sigil, refer to part of the search regex given in the Find field? For example, the text
"theAmerican" should be changed to "the American".

The Search field is easy, "[a-z][A-Z]" but "[a-z] [A-Z]" does not work in the Replace field beacause Sigil replaces the regex text as a literal instead of keeping the lower case and the upper case letters, whatever they are.

I have almost no knowledge of regular expressions, please help me in this.

Doitsu 02-01-2013 07:26 AM

Quote:

Originally Posted by ditke (Post 2406224)
How can I, in the Replace field in Sigil, refer to part of the search regex given in the Find field? For example, the text
"theAmerican" should be changed to "the American".

Use round brackets:

Find: ([a-z])([A-Z])
Replace: \1 \2

For more information search for backreferences.

ditke 02-01-2013 09:26 AM

It works perfectly, thank you so much, Doitsu!

Sabardeyn 02-02-2013 10:06 AM

I'm off topic here, as this is about a *.CBR and otherwise has nothing to do with Sigil, but the recent post reminded me I meant to inquire.

I've got a series of images with a page number added at the end, but the existing page numbering is a disaster. Is there a means of stripping any ending numbers (only), without removing numbers from other locations in the filename?

Ultimately I want my output to look like:
Terminator 2 -- ch14 pg023.jpg
With preceding zeros as placeholders to force proper viewing order.

I've tried: [0-9,3] to find the page numbers, but that removes all of the numbers in the example filename shown above.

If I try appending the $, then I get no matches. I know it has to be something I am doing wrong.

Adding page numbers back in is a straight %03d replacement which I've been doing as a second step after stripping the pages (it's a total renumber, nothing can be saved).

PS: My apologies if this message needs to be moved, but I wasn't sure where else it might be more relevant within the forums.

Perkin 02-02-2013 10:55 AM

I take it you're extracting, using a proper re-name util, then re-packing.
Search : (.*pg)\d{1,3}(\.jpg)
(stores all upto (and including) 'pg', discard the next digits, store extension)
Replace \1<whatever inserts counter>\2
so if ? is a number counter char \1???\2
(or depending on regex you'd need $ instead of \ for group replacement.

Sabardeyn 02-02-2013 05:49 PM

Perkin,
Thanks for understanding what I'm doing despite me leaving the extract/fix/repack process unmentioned. :) I did however leave you with a bad example - that was the output, not the input.

The thing is, I cannot use "pg" as that is something I am adding. Basically the input files are variously named but with numbers at the end.

So a more realistic input page might be:
Terminator 2 23.jpg
that I want to rename into:
Terminator 2 -- ch14 pg023.jpg

There are other naming issues, but I've managed to handle them. Perhaps not optimally, but they get the job done. I just can't seem to isolate the numbers at the end of the filename and strip them. To the best of my knowledge my bulk renamer is using python flavored regex.

I haven't tried you code yet, but I will.

mzmm 02-02-2013 06:55 PM

--edit, not sure if this helps, unfamiliar with 'extract/fix/repack process'

i think you'd need to do a couple of passes to turn 1.jpg into 001.jpg, 11.jpg into 011.jpg, etc.

Code:

(.*?\s)(\d)(\.jpg)
\100\2\3

and then
(.*?\s)(\d{2})(\.jpg)
\10\2\3


Perkin 02-03-2013 09:24 AM

@Sabardeyn, can you give a few more (differing?) example filenames - so we can see what's consistent or what isn't, with what you would like them mapped to.

What's the name of your Batch Rename app/script, I can then scan through it's docs and try and see what the correct replace would be.

ReaderRabbit 02-28-2013 12:55 PM

Joining Paragraphs when opening and closing quotes are not in same paragraph
 
Quote:

Originally Posted by MuskratBooks (Post 2397534)
In Sigil this expression has been helpful:
(“[^”\r\n]*)</p>\s+<p class="calibre.">
Replace with (has a trailing space): \1

This indentifies paragraphs where a opening smart quote is not matched with a closing smart quote and joins that paragraph with the next one. Its not fool proof, but saves a lot of time.

I use calibre conversion to switch straight quotes to smart quotes. Its under "Look Feel", check by "smarten punctuation". Easier to fix its mistakes than to find and fix 'em all.

Good Luck!

:thanks:
Thank you Muskrat. This answers my question from 1/13/13 perfectly! :)

Step 1: I go into Calibre and change straight quotes to curley quotes, then
Step 2: I open the book in Sigil and use your Regex suggestion and it works perfectly.

At first it didn't work then I checked to see if I accidentally copied the blank space after your find expression, and I had. I backed the blank space out and it worked ;)

I ♥ brainiacs!

Ripplinger 02-28-2013 02:26 PM

Quote:

Originally Posted by MuskratBooks (Post 2397534)
In Sigil this expression has been helpful:
(“[^”\r\n]*)</p>\s+<p class="calibre.">
Replace with (has a trailing space): \1

This indentifies paragraphs where a opening smart quote is not matched with a closing smart quote and joins that paragraph with the next one. Its not fool proof, but saves a lot of time.

I use calibre conversion to switch straight quotes to smart quotes. Its under "Look Feel", check by "smarten punctuation". Easier to fix its mistakes than to find and fix 'em all.

Good Luck!

This is fantastic and works great. I swapped the double quotes to single quotes then to find those within the string. It also found a couple of wrong direction quotes for me as well. Edit: And found a few instances of 2 single quotes where a double quote should have been.

Is there a way to find the reverse situation, to find paragraphs where there is an ending quote but was no starting quote at the beginning of the paragraph?

mzmm 03-04-2013 05:48 AM

Quote:

Originally Posted by Ripplinger (Post 2440097)
Is there a way to find the reverse situation, to find paragraphs where there is an ending quote but was no starting quote at the beginning of the paragraph?

think that's a bit more difficult. if there's no space between the <p> tags and the text you could use something like this

Code:

<p[^>]*>(?<!")(\w.+?")</p>
which finds a <p> followed by an alphanumeric character that is not preceded by a quotation mark. not the best solution, but for simple and consistent texts it could work.

Ripplinger 03-04-2013 07:50 AM

I couldn't get that to work at all and was about to give up and then realized you didn't use the curly smart quotes. Once I changed it to smart quotes, it would work somewhat, but it will also pick up any sentence or paragraph that doesn't immediately start with a quote. So it would pick up paragraphs like this:

Pamela shuddered. “We’ve been making ourselves polite to a murderess.”

And there's usually far too many of those types of sentences to want to read through over 500 of them to find the beginning quote buried further in.

ReaderRabbit 03-05-2013 12:09 PM

Sorry, this is such an obvious question and is probably answered somewhere but I didn't find it.

What would be the best way to find and eliminate page numbers such as:

He glanced 190</p>

<p class="calibre1">up at the big clock

Doitsu 03-05-2013 12:29 PM

Quote:

Originally Posted by ReaderRabbit (Post 2445159)
What would be the best way to find and eliminate page numbers such as:

He glanced 190</p>

<p class="calibre1">up at the big clock

Assuming that each page number is preceded by a space, the following quick & dirty regex should work:

Code:

\d+</p>\s+<p class=".*?">
(Replace with nothing.)

mzmm 03-05-2013 12:30 PM

Quote:

Originally Posted by Ripplinger (Post 2443940)
I couldn't get that to work at all and was about to give up and then realized you didn't use the curly smart quotes. Once I changed it to smart quotes, it would work somewhat, but it will also pick up any sentence or paragraph that doesn't immediately start with a quote. So it would pick up paragraphs like this:

Pamela shuddered. “We’ve been making ourselves polite to a murderess.”

And there's usually far too many of those types of sentences to want to read through over 500 of them to find the beginning quote buried further in.

try this? it'll probably still miss some (like if the closing quote butts up against a </span> instead of a </p> for example) so you'd probably want to scan the text afterwards but it might save you some copy/pasting.

Code:

find: (<p[^>]*>)(?:\s+)?([^“]+?”)(?:\s+)?(</p>)

replace: \1“\2\3


ReaderRabbit 03-05-2013 12:59 PM

Quote:

Originally Posted by Doitsu (Post 2445172)
Assuming that each page number is preceded by a space, the following quick & dirty regex should work:

Code:

\d+</p>\s+<p class=".*?">
(Replace with nothing.)

Thanks so much! Works perfectly.

What about page numbers like this: <p class="calibre1">200</p>

I used to be able to find them with the 'Wildcard' search and replace. I am using version version 0.6.2 of Sigil. Where has that feature gone?

I ♥ brainiacs

Doitsu 03-05-2013 01:18 PM

Quote:

Originally Posted by ReaderRabbit (Post 2445194)
What about page numbers like this: <p class="calibre1">200</p>

Use: <p class=".*?">\d+</p>


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.