MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

DiapDealer 06-19-2012 09:25 AM

Quote:

Originally Posted by roger64 (Post 2120333)
Successive Find and Replace

I wish to clean an html text which suffers from recurrent mistakes from an OCR engine (Cuneiform).

When I meet one the mistakes, I make a replacement and I note it. After some pages, I met most of the mistakes and now I intend to build a regex, adding as many as 15 successive simple search and replace like the following two.
A@ → à
B@ → ç
I do not know how to perform these 15 F&R within a simple regex.Suppose I would like to build it for the two above, what should I write?

Nota: I already use utf8 for the whole text.

I'm not sure what you're asking for is feasible. What you've described is something that would be more suited to an external program/algorithm (or a plugin) rather than one single Regular Expression. Finding all 15 with one expression wouldn't be the hard part... replacement based on "if/then" logic is where it would fall apart.

roger64 06-19-2012 11:07 AM

OK. Thanks for your answer. I will try to find another solution

Doitsu 06-19-2012 11:47 AM

You could create a simple sed script with one line for each character that you need to fix. E.g.

Code:

s/A@/à/g
s/B@/ç/g

Then simply save the lines as a utf8 text file (without BOM), e.g. fix.sed, and execute it with sed:

Code:

sed -f fix.sed -i *.html
(Note that this will overwrite the original files.)

roger64 06-19-2012 12:07 PM

@Doitsu

Wow!! It's working very well! Thanks a lot!!
What means BOM?

DiapDealer 06-19-2012 12:09 PM

Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o

roger64 06-19-2012 12:27 PM

Quote:

Originally Posted by DiapDealer (Post 2120733)
Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o

No sorry, me too :)

Doitsu 06-19-2012 12:28 PM

Quote:

Originally Posted by roger64 (Post 2120729)
What means BOM?

BOM = byte order mark.

At least the Windows GNU sed port requires that both the .html files and the sed script be utf8 files without byte order marks. AFAIK, .html files created by Sigil are automatically saved without BOMs. I.e. you only have to make sure that the sed script doesn't have one either.

Quote:

Originally Posted by DiapDealer (Post 2120733)
Sorry, I was only thinking in terms of the F&R regex feature of Sigil. :o

Every now and then you may want to widen your horizon. :D
But you are of course right, Sigil doesn't do sed.

That's when even rudimentary sed or Perl skills come in handy.

DiapDealer 06-19-2012 12:43 PM

Quote:

Originally Posted by Doitsu (Post 2120752)
Every now and then you may want to widen your horizon. :D

But I suffer from acute agoraphobia. :D

PeterT 06-19-2012 04:00 PM

Quote:

Originally Posted by roger64 (Post 2120729)
@Doitsu

Wow!! It's working very well! Thanks a lot!!
What means BOM?

Byte Order Mark

roger64 06-20-2012 05:53 AM

Thanks all for the lesson. :)

soulafein 06-22-2012 08:05 PM

Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

theducks 06-22-2012 08:37 PM

Quote:

Originally Posted by soulafein (Post 2124530)
Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

search: ([a-z])-([a-z])

replace: \1\2

only if surrounded by lowercase letters BUT :eek: it also gets legitimate hyphenated words

DiapDealer 06-22-2012 08:48 PM

Quote:

Originally Posted by soulafein (Post 2124530)
Hi! I'm looking for an expression that erase "- " but not " - ".
(example: sim- ple, not: word - word).
Could somebody help me??

There's no real way of knowing that only complete words are on either side of the hyphen, but strictly in keeping with what you asked...

Find: (?<!\s)-\s Or: \w\K-\s
Replace: <empty/blank>

Please test first, and do keep in mind that there's many situations in normal written text where what you're looking for will (and should) occur. I certainly wouldn't suggest using "Replace all" but it may help you narrow down the occurrences enough where you can sign off on each and every replacement.

goldilocks 06-22-2012 08:55 PM

Help! I am clueless about regex. I have a Word document I saved as HTML Filtered (sure didn't seem to filter much!). I imported it into Calibre and converted to ePub. Between MSO and Calibre I ended up with over 41,000 :( rows in the CSS. Every paragraph has its own class. Examples:
<p class="MsoNormal79"><span class="calibre14">
<p class="MsoNormal80"><span class="calibre20">
<p class="MsoNormal81"><span class="calibre20">
<p class="MsoNormal82"><span class="calibre17">

I want them all to say:
<p class="paragraphtext">

Can I put something in find to replace them all at once?:help:

Karen

DiapDealer 06-22-2012 10:07 PM

You could very well end up with a disaster if you're not careful. I would start with the paragraphs first as spans can get a bit hairy.

If you're absolutely sure that you want to change everything that has a class name of "MsoNormalXX" (X being numerals) to "paragraphtext", then:

Find: <p class="MsoNormal\d+">
Replace: <p class="paragraphtext">

Make sure you have good backups in case things don't turn out the way you've planned.


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.