MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex - replace only part of a string - how? (https://www.mobileread.com/forums/showthread.php?t=169691)

flameproof 02-20-2012 09:34 AM

Regex - replace only part of a string - how?
 
I have often problems with books that were auto-converted by Calibre. Here is one issue:

Text has often wrong line breaks.

Example:

Code:

  <p class="calibre2">This is just a sample</p>

  <p class="calibre2">text with no meaning.</p>

I can find it with the string:

Code:

[a-z]</p>

  <p class="calibre2">

But when I replace it then (of course) the last letter is missing. Without the [a-z] I would catch normal end of sentence line breaks.

Is there a way?


DOH! I found it!


Search string: (\w+)</p>

<p class="calibre2">

Replace with: \1

flameproof 02-20-2012 10:21 AM

Please let me add one more common problem: wrong hyphen (probably from the PDF)

Search: (\w)-(\w)
replace with: \1\2

How can I make it case sensitive? I like to correct 'ele-phant' but not 'John-Bob' ?

PeterT 02-20-2012 10:29 AM

Quote:

Originally Posted by flameproof (Post 1972722)
Please let me add one more common problem: wrong hyphen (probably from the PDF)

Search: (\w)-(\w)
replace with: \1\2

How can I make it case sensitive? I like to correct 'ele-phant' but not 'John-Bob' ?

I think a brute force approach would be
([a-z]\w)-(\w)

flameproof 02-20-2012 10:37 AM

Quote:

Originally Posted by PeterT (Post 1972729)
I think a brute force approach would be
([a-z]\w)-(\w)

Thanks.

Seems '([a-z])-([a-z])' with a clicked 'Match Cases' is OK too.

Serpentine 02-21-2012 08:24 PM

Use the posix character classes rather, for your example : ([[:lower:]])-([[:lower:]])

List of posix char classes

WS64 02-22-2012 12:11 PM

[[:lower:]]?

I guess you mean [:lower:]
But honestly, I find [a-z] way easier to write. Especially since I have to add some German letters often, like [a-zäöüß]

DiapDealer 02-22-2012 12:37 PM

Quote:

I guess you mean [:lower:]
Nope. [[:lower:]] is the correct usage.

Timur 02-22-2012 12:50 PM

I was used to write [a-z] for lowercase letters too, but since I discovered that unicode properties flag is working in Sigil 0.5 (*UCP), I simply use character classes with it to cover non-ASCII letters.
\w, \W, [:lower:], [:upper:], [:alpha:], [:alnum:] are all affected by (*UCP).

WS64 02-22-2012 02:36 PM

Quote:

Originally Posted by DiapDealer (Post 1975774)
Nope. [[:lower:]] is the correct usage.

I really had to check these...
You are (of course) right, I was wrong.
I never tried those since I never saw a reason to use them...

Serpentine 02-22-2012 08:50 PM

Quote:

Originally Posted by WS64 (Post 1975969)
I never tried those since I never saw a reason to use them...

They're great, since they cover your unicode characters too, for example you dont have to add ä,ß, etc - they are already understood to be lowercase.

[a-zäöüß] will all be captured by [[:lower:]], as well as a load more of edge cases which you might not have thought of. Saving you time, making edits more complete. The punct class is also especially useful - and very often overlooked.

The more you know!

WS64 02-23-2012 03:51 AM

Quote:

Originally Posted by Serpentine (Post 1976466)
They're great, since they cover your unicode characters too, for example you dont have to add ä,ß, etc - they are already understood to be lowercase.

I just checked. [[:lower:]] does NOT find äöüß.

Timur 02-23-2012 05:43 AM

@WS64: add (*UCP) in front of your pattern like:

Code:

(*UCP)\b[[:lower:]]+


All times are GMT -4. The time now is 07:54 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.