MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

roger64 06-19-2012 12:27 PM

Quote:

Originally Posted by DiapDealer (Post 2120733)
Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o

No sorry, me too :)

Doitsu 06-19-2012 12:28 PM

Quote:

Originally Posted by roger64 (Post 2120729)
What means BOM?

BOM = byte order mark.

At least the Windows GNU sed port requires that both the .html files and the sed script be utf8 files without byte order marks. AFAIK, .html files created by Sigil are automatically saved without BOMs. I.e. you only have to make sure that the sed script doesn't have one either.

Quote:

Originally Posted by DiapDealer (Post 2120733)
Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o

Every now and then you may want to widen your horizon. :D
But you are of course right, Sigil doesn't do sed.

That's when even rudimentary sed or Perl skills come in handy.

DiapDealer 06-19-2012 12:43 PM

Quote:

Originally Posted by Doitsu (Post 2120752)
Every now and then you may want to widen your horizon. :D

But I suffer from acute agoraphobia. :D

PeterT 06-19-2012 04:00 PM

Quote:

Originally Posted by roger64 (Post 2120729)
@Doitsu

Wow!! It's working very well! Thanks a lot!!
What means BOM?

Byte Order Mark

roger64 06-20-2012 05:53 AM

Thanks all for the lesson. :)

soulafein 06-22-2012 08:05 PM

Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

theducks 06-22-2012 08:37 PM

Quote:

Originally Posted by soulafein (Post 2124530)
Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

search: ([a-z])-([a-z])

replace: \1\2

only if surrounded by lowercase letters BUT :eek: it also gets legitimate hyphenated words

DiapDealer 06-22-2012 08:48 PM

Quote:

Originally Posted by soulafein (Post 2124530)
Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

There's no real way of knowing that only complete words are on either side of the hyphen, but strictly in keeping with what you asked...

Find: (?<!\s)-\s Or: \w\K-\s
Replace: <empty/blank>

Please test first, and do keep in mind that there's many situations in normal written text where what you're looking for will (and should) occur. I certainly wouldn't suggest using "Replace all" but it may help you narrow down the occurrences enough where you can sign off on each and every replacement.

goldilocks 06-22-2012 08:55 PM

Help! I am clueless about regex. I have a Word document I saved as HTML Filtered (sure didn't seem to filter much!). I imported it into Calibre and converted to ePub. Between MSO and Calibre I ended up with over 41,000 :( rows in the CSS. Every paragraph has its own class. Examples:
<p class="MsoNormal79"><span class="calibre14">
<p class="MsoNormal80"><span class="calibre20">
<p class="MsoNormal81"><span class="calibre20">
<p class="MsoNormal82"><span class="calibre17">

I want them all to say:
<p class="paragraphtext">

Can I put something in find to replace them all at once?:help:

Karen

DiapDealer 06-22-2012 10:07 PM

You could very well end up with a disaster if you're not careful. I would start with the paragraphs first as spans can get a bit hairy.

If you're absolutely sure that you want to change everything that has a class name of "MsoNormalXX" (X being numerals) to "paragraphtext", then:

Find: <p class="MsoNormal\d+">
Replace: <p class="paragraphtext">

Make sure you have good backups in case things don't turn out the way you've planned.

Toxaris 06-23-2012 03:19 AM

Don't use Calibre to clean up the filtered HTML. Either do it manually in Sigil or use a program/macro to do it.
Conversion to ePUB in Calibre will cause big changes in your styles. Further more, it is not necessary, since Sigil can import HTML without issues.

goldilocks 06-23-2012 09:49 AM

Quote:

Originally Posted by DiapDealer (Post 2124639)
You could very well end up with a disaster if you're not careful. I would start with the paragraphs first as spans can get a bit hairy.

If you're absolutely sure that you want to change everything that has a class name of "MsoNormalXX" (X being numerals) to "paragraphtext", then:

Find: <p class="MsoNormal\d+">
Replace: <p class="paragraphtext">

Make sure you have good backups in case things don't turn out the way you've planned.

Thanks DiapDealer, but it didn't work. I keep originals and backups separate from my "working" folder.

Karen

goldilocks 06-23-2012 10:14 AM

Quote:

Originally Posted by Toxaris (Post 2124768)
Don't use Calibre to clean up the filtered HTML. Either do it manually in Sigil or use a program/macro to do it.
Conversion to ePUB in Calibre will cause big changes in your styles. Further more, it is not necessary, since Sigil can import HTML without issues.

Toxaris, thanks for your suggestion. I did not use Calibre on the htm file but it really isn't much better. There is no style sheet and there are over 3000 expressions in the /*<![CDATA[*/ area. Every paragraph of text is filled with another paragraph of commands?. Also it is one large file - I do know how to split it.

But, I'll keep working on it and eventually I will have a decent looking, if not perfect, eBook!

Karen

DiapDealer 06-23-2012 11:01 AM

Quote:

Thanks DiapDealer, but it didn't work. I keep originals and backups separate from my "working" folder.
I'm not sure what you mean by "didn't work." :blink:

It didn't do what it was intended to do?... or it didn't do what you wanted/expected it to do? There's a difference. ;)

It certainly should have done what I said it would do... if you had the ePub open in Sigil, in Code View(an html file), with the F&R widget open (and in Regex mode) and set to "All HTML Files".

roger64 06-26-2012 07:10 AM

Suppressing <br /> tags only in "body text" style.

Could there be a way to destroy the soft hyphens only when they are included in a "body text" paragraph?

Rationale:

After using a new (and not perfect) OCR , I found that my recognized text was interspersed with a lot of <br /> tags (soft hyphens?). I usually insert the html file in OpenOffice and clean all formatting to begin with. Even this way, I realized that these resilient tags survived.

It is not that bad. Some poems or songs are thus nicely transcribed. On the other hand, I have to clean these tags for many standard paragraphs of text.

Sigil provides a simple way out. The user has a choice either cleaning every one of them, good and bad, or selectively and patiently suppress the useless tags...

There could a better one.

Give your songs or poems their own style, keep standard text in its "body text" class and then launch the following Regex...

DiapDealer 06-26-2012 10:50 AM

<br />'s are not soft-hyphens.... just to be clear. ;)

Quote:

Originally Posted by roger64
Give your songs or poems their own style, keep standard text in its "body text" class and then launch the following Regex...

Tricky... but—strictly speaking of Sigil (PCRE) here—then possibly:

If there's only one occurrence of the <br /> tag inside a paragraph, this expression should find it (only inside p tags of the class "body-text"):
Code:

<p class="body-text">(?!</p>).*\K<br[^>]*?/>
(If there's more than one occurrence of <br /> the above expression will only match the last one)

The following expression should match the first occurrence (if there's more than one) of a <br /> tag inside p tags of the class "body-text".
Code:

(?U)<p class="body-text">(?!</p>).*\K<br[^>]*?/>
Leaving the "Replace" field blank when replacing should then get rid of the <br /> tags.

It's certainly not ideal, but if you have multiple <br /> tags inside the targeted paragraph (class name "body-text"), you could conceivably run one or the other of these "Replace All" expressions multiple times until the search no longer matches anything. Still quicker than stepping through each occurrence (and will ignore all other p classes), though.

roger64 06-26-2012 11:22 AM

@DiapDealer

Thanks very much for your reply. I will put it soon to work.

Do you think it is possible to join your two commands with a kind of AND/OR link so that it would destroy the tags two by two or be happy with one?

Thanks for the vocabulary. I was not sure about it. Now I know.

DiapDealer 06-26-2012 11:32 AM

Quote:

Originally Posted by roger64 (Post 2127877)
Do you think it is possible to join your two commands with a kind of AND/OR link so that it would destroy the tags two by two or be happy with one?

I certainly wouldn't know of any way to easily combine them. It really boils down to the lazyiness/greediness aspects of the various regex repetition-control characters. I can't imagine it would take that many clicks of the "replace all" button to rid the "body-text" paragraphs of <br /> tags, but then again... I'm not looking at the afflicted code either. ;)

roger64 06-27-2012 03:27 AM

@DiapDealer

I am very pleased to report full success of your Regex ( I used the first one) which deleted successively in seven busy rounds: 53/22/7/5/2/2/2 occurrences of the <br /> tag. :thumbsup: :thanks:

This is only the top of the iceberg, because on the odt I previously manually destroyed probably about over one hundred. I did not know then I would use your regex.

For information, this is the styles break-up of the test EPUB (classes only):
Spoiler:

Code:

class="Textbody" 1676
class="frameGraphics" 66
class="let" 64
class="let2" 64
class="let1" 64
class="Centrage" 62
class="smcpTypeV" 46
class="smcpTypeA" 16
class="smcpDroite" 16
class="Header" 8
class="smcpCentrage" 6
class="Italdroite" 4


DiapDealer 06-27-2012 03:15 PM

Quote:

Originally Posted by roger64 (Post 2128703)
I am very pleased to report full success of your Regex ( I used the first one) which deleted successively in seven busy rounds: 53/22/7/5/2/2/2 occurrences of the <br /> tag.

Cool! Glad it worked for you. I've stashed it away myself for tweaking in various ways. :)

mrjoeyman 07-03-2012 05:18 AM

reverse linking time consuming woes
 
<a href="../Text/notes.html#scrip1" id="backscrip1">This text is a link</a>

The above is some code in my file that I use to reverse link, or tag/anchor, whatever they call it. You click on a link in one file (in this case clicking on the text "This text is a link" would take you to the "../Text/notes.html file, where another link is designated as "scrip1", with the previous link "This text is a link" was designated as "backscrip1". So they go back and forth. When there are hundreds of reverse links, it take me a short time to list the main code ie...

<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>
<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>
<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>
<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>
<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>

but now I have to go back and change the second occurrence of the linking code to "2" then "3" then "4", ie...

<a href="../Text/scriptures.html#scrip1" id="backscrip1">This text is a link</a>

<a href="../Text/scriptures.html#scrip2" id="backscrip2">This text is a link</a>

<a href="../Text/scriptures.html#scrip3" id="backscrip3">This text is a link</a>

<a href="../Text/scriptures.html#scrip4" id="backscrip4">This text is a link</a>

....you get the idea.

Is there a way to use the find and replace in such a way that it would search for this code and bump up the number for each occurrence, so I won't have to manually find each one and put in each number separately myself?

:thanks:

Doitsu 07-03-2012 06:48 AM

Quote:

Originally Posted by mrjoeyman (Post 2135364)
Is there a way to use the find and replace in such a way that it would search for this code and bump up the number for each occurrence, so I won't have to manually find each one and put in each number separately myself?

AFAIK, you cannot increment numbers using regular expressions. This kind of functionality can only be achieved with a scripting language.

mrjoeyman 07-03-2012 07:04 AM

I was afraid of that. I guess the best thing would be to save it as a template and insert the text, but that still entails manually inserting each occurrence. Is there a quicker way of doing such a task that I just am not aware of yet? Thanks for the consideration.

Jellby 07-03-2012 08:06 AM

I don't know about Sigil, but this is what I do in vim:

I use a special symbol (¬, |, ¦ are useful for this) where I want the consecutive numbers:

Code:

<a href="../Text/scriptures.html#scrip¬" id="backscrip¬">This text is a link</a>
Once I have all the links like that, I run this command in vim:

Code:

: let n=1 | g/¬/s/¬/\=n/g | let n+=1
which replaces all ¬ in a line with the number n, and n is incremented by one every time a line with ¬ is found.

mrjoeyman 07-03-2012 11:56 PM

Omg are you serious? I will have to give it a go! So how would I go about getting the code into Sigil afterward? That is the only way I know to convert it into epub.

Jellby 07-04-2012 04:59 AM

An ePub is a zip, so just extract the file you want to modify, change it with vim (or your preferred editor), and zip it back.

mrjoeyman 07-04-2012 09:02 PM

Doh! I should have thunk of that! Thanks!! By the way I got to the end of my first tutorial with Vim. I can now say I performed my first "yank and put". Pretty neat editor. I have tried to make the ¬ character in Vim but it doesn't work as it does in this message. (alt-170). Can you tell me how to make it in Vim?

update: well it works as you said, sweet! I just copied the "¬" but I still would like to learn how to make it in Vim. Thanks.

Jellby 07-05-2012 05:26 AM

I use ¬ because I can easily input it with my keyboard layout (Spanish): AltGr+6. Use whatever symbol you can find in your keyboard that's not used elsewhere: #, ~, @...

signum 07-07-2012 01:05 AM

Quote:

Originally Posted by mrjoeyman (Post 2137262)
I have tried to make the ¬ character in Vim but it doesn't work as it does in this message. (alt-170). Can you tell me how to make it in Vim?

update: well it works as you said, sweet! I just copied the "¬" but I still would like to learn how to make it in Vim. Thanks.

Assuming you are in insert (or append) mode in vim:

<ctrl-v>uac<esc>

In human language, this means: hold down the "ctrl" key and press v, release both, type uac, then tap the "esc" key.

The "ctrl-v" says a multi-keystroke character follows, "u" means it is UTF, "ac" is the hex code for the "not" symbol, and "escape" ends the sequence.

Having said all that, it is a lot easier to just use some other seldom-used character that appears on your keyboard, such as "@", instead of the "not" character.

mrjoeyman 07-07-2012 10:49 AM

Thanks, you are right, simple is better :)

Danger 08-06-2012 11:22 AM

First thanks for everyones help here. While I haven't posted for help the answers to other peoples problems has helped me as well when I had similar questions. however I have a question that I don't see an answer to.

I am trying to remove a start and end div tag. These span an entire chapter.
Code:

<body>
  <div class="story" id="part-27">
...
  </div>
</body>

I've tried:
FIND
Code:

<div class="story" id="part-\d+">(.*?)</div>
&
Code:

<div class="story" id="part-\d+">(.*)</div>
and a few other variations but Sigil always returns a zero count. Just wondering what I am doing wrong. This isn't the first time I've run into this problem. Before I've just worked around it by working with much smaller bits but I'd like to know just what it is I am doing wrong because as far as I can tell that should work. Using Sigil 0.5.902

EDIT:
Ok it seems that the regex was fine, it just doesn't work in 0.5.902 but does work in 0.5.3 which I don't like using much for finding/replacing because over half the time I get left with a literal \1 instead of the actual text. Which of course I have to UNDO, FIND, REPLACE for each. Easy enough when it's a large block of text, not so easy when it's a word or sentence forcing me to do another FIND for any 1< instances. A REPLACE ALL is just a nightmare if you don't have a backup.

Pablo 08-06-2012 11:49 AM

Quote:

Originally Posted by Danger (Post 2175925)
Code:

<div class="story" id="part-\d+">(.*?)</div>

Try this:

Code:

<div class="story" id="part-[0-9]+">(.*?)</div>

Danger 08-06-2012 12:29 PM

Quote:

Originally Posted by Pablo (Post 2175947)
Try this:

Code:

<div class="story" id="part-[0-9]+">(.*?)</div>

Checked it on a backup (I'd already cleaned up the code on my working copy) but I still get "no matchs found" in v0.5.902. But thanks, that was one variation I hadn't tried yet.

paulfiera 08-06-2012 12:51 PM

Help with regex and chapters
 
I have a book I'm fixing where the chapters are named:

Code:

<p class="calibre4">1</p>
<p class="calibre4">2</p>

...and so on.

How can I change these occurences with, for instance
Code:

<h3>Chapter 1</h3>
<h3>Chapter 2</h3>

...and so on?

I've tried all the combinations I know of but can't seem to get it done.

Many thanks !
Paul

Doitsu 08-06-2012 01:55 PM

I suck at regular expressions, but this should work in Sigil 0.5.3:

Find: <p class="calibre4">(\d+)</p>
Replace: <h3>Chapter \1</h3>

Timur 08-06-2012 01:56 PM

@Danger: In most regex flavors dot(.) does not match newline characters by default. Your case requires the dot to match newlines. In Sigil either select Regex Dotall from the mode listbox(beta version does not have that mode iirc), or append (?s) in front of your find pattern. Example:

Code:

(?s)<div class="story" id="part-\d+">(.*)</div>

paulfiera 08-06-2012 01:59 PM

Quote:

Originally Posted by Doitsu (Post 2176057)
I suck at regular expressions, but this should work in Sigil 0.5.3:

Find: <p class="calibre4">(\d+)</p>
Replace: <h3>Chapter \1</h3>

Many thanks, Doitsu.

It definitely worked ! :thumbsup:

Danger 08-06-2012 02:04 PM

Quote:

Originally Posted by Timur (Post 2176059)
@Danger: In most regex flavors dot(.) does not match newline characters by default. Your case requires the dot to match newlines. In Sigil either select Regex Dotall from the mode listbox(beta version does not have that mode iirc), or append (?s) in front of your find pattern. Example:

Code:

(?s)<div class="story" id="part-\d+">(.*)</div>

Awesome, I knew it wasn't matching newlines but couldn't figure out how to get it to do so. Thank you Timur, that works great.

theducks 08-06-2012 10:52 PM

Quote:

Originally Posted by Danger (Post 2175925)
First thanks for everyones help here. While I haven't posted for help the answers to other peoples problems has helped me as well when I had similar questions. however I have a question that I don't see an answer to.

I am trying to remove a start and end div tag. These span an entire chapter.
Code:

<body>
  <div class="story" id="part-27">
...
  </div>
</body>

I've tried:
FIND
Code:

<div class="story" id="part-\d+">(.*?)</div>
&
Code:

<div class="story" id="part-\d+">(.*)</div>
and a few other variations but Sigil always returns a zero count. Just wondering what I am doing wrong. This isn't the first time I've run into this problem. Before I've just worked around it by working with much smaller bits but I'd like to know just what it is I am doing wrong because as far as I can tell that should work. Using Sigil 0.5.902

EDIT:
Ok it seems that the regex was fine, it just doesn't work in 0.5.902 but does work in 0.5.3 which I don't like using much for finding/replacing because over half the time I get left with a literal \1 instead of the actual text. Which of course I have to UNDO, FIND, REPLACE for each. Easy enough when it's a large block of text, not so easy when it's a word or sentence forcing me to do another FIND for any 1< instances. A REPLACE ALL is just a nightmare if you don't have a backup.

You need to tell it it is multiline

Code:

(?sm)<div class="story" id="part-\d+">(.*?)</div>

Gunnerp245 08-11-2012 09:46 AM

I would like to change the capitalization a particular phrase across a book e.g. chapter one to Chapter One. I can detect the instances using (\D+) (\D+) and know the replacement would be \1 \2, but not how to change the capitalization.


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.