regex newbie search end of string char problem

michaelbr · 10-10-2020, 10:30 AM

I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z], here is an example:
paragraph 1: .....

Code:

.’</p>

paragraph 2: .....

Code:

.</p>

paragraph 3: ......

Code:

</p>

the .... can be either char or number, I'd like to find only paragraph 3, I tried this regex

Code:

([^.]|[^.’])<\/p>$

, but it's not working, can someone please tell me what's the best way to search for this string?

theducks · 10-10-2020, 11:28 AM

I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)

Code:

74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2

michaelbr · 10-10-2020, 12:19 PM

Quote:

Originally Posted by theducks

I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)

Code:

74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2

Thanks for the tips, it's solved.

Tex2002ans · 10-12-2020, 02:35 PM

Quote:

Originally Posted by michaelbr

I tried this regex

Code:

([^.]|[^.’])<\/p>$

, but it's not working, can someone please tell me what's the best way to search for this string?

The . is a very special symbol in Regex. It stands for "any character". If you want to look for an actual period, you'll want to add a \ before it:

. = any character
\. = a period

Quote:

Originally Posted by michaelbr

I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z], [...]

Can you try to explain, in words, what's the issue you're trying to solve? And give a few more examples of before/after?

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:

<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>

After:

Code:

<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>

* * *

Here are the 3 sets of Regex I personally use:

Note: DO NOT do a "Replace All". Replace most of these on a case-by-case basis. Also, make sure to save a backup copy of your file.

Regex #1 (Hyphens)

This searches for a hyphen at the end of a paragraph:

Search: -\s+
Replace: (LEAVE THIS COMPLETELY BLANK)

OR alternate:

Search: -\s+
Replace: -

Example:

Code:

<p>This example is where the pre-</p>
<p>split occurs.</p>

Regex #2 (Not Closing Punctuation)

This searches for everything that's NOT a period, exclamation point, question mark, etc.:

Search: ([^>”\?\!\.])\s+
Replace: \1

Example:

Code:

<p>This is an example</p>
<p>sentence where the person,</p>
<p>places, and things occur.</p>

Note: You can easily add different "valid" punctuation endings as needed. Like a colon may or may not be:

In Fiction, colons likely occur within sentences.
In Non-Fiction, colons likely occur at the end of paragraphs.

Regex #3 (Lowercase Start)

This searches for a lowercase letter at the very beginning of the paragraph:

Search: [a-z]

I make sure to run this after #1 and #2 to catch any strays, then decide these on a case-by-case basis.

Example:

Code:

<p>The fishy “car dealership”</p>
<p>was called Mr. X’s Emporium.</p>

michaelbr · 10-13-2020, 02:30 PM

Quote:

Originally Posted by Tex2002ans

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:

<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>

After:

Code:

<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>

Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do, I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead. Again thanks so much for sharing.

Tex2002ans · 10-13-2020, 03:48 PM

Quote:

Originally Posted by michaelbr

Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do,

Glad to see I guessed correctly.

Quote:

Originally Posted by michaelbr

I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead.

If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])\s+
Replace: \1 <---- Make sure you put a space after.

Code:

<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>

but I think my Regexes are better. :P

michaelbr · 10-15-2020, 01:54 PM

Quote:

Originally Posted by Tex2002ans

Glad to see I guessed correctly.

If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])\s+
Replace: \1 <---- Make sure you put a space after.

Code:

<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>

but I think my Regexes are better. :P

Yes, certainly, yours are much better, thanks for sharing.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex in search problems (NOT Search&Replace; the search bar)	lairdb	Calibre	3	03-15-2017 07:10 PM
Search regex problem	ColMac	Editor	23	04-17-2015 03:58 PM
Regex Problem / Line that does't end with .</p>	mcam77	Sigil	6	03-25-2013 06:38 PM
Regex - replace only part of a string - how?	flameproof	Sigil	11	02-23-2012 04:43 AM
My RegEx isn't doing what I hoped to remove page numbers and a fixed string	winterminute	Calibre	6	12-19-2010 10:55 PM

10-10-2020, 10:30 AM	#1
michaelbr Connoisseur Posts: 77 Karma: 10 Join Date: Aug 2010 Location: Murcia/Spain Device: Android 12	regex newbie search end of string char problem I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, here is an example: paragraph 1: ..... Code: .’</p> paragraph 2: ..... Code: .</p> paragraph 3: ...... Code: </p> the .... can be either char or number, I'd like to find only paragraph 3, I tried this regex Code: ([^.]\|[^.’])<\/p>$ , but it's not working, can someone please tell me what's the best way to search for this string?

10-10-2020, 11:28 AM	#2
theducks Well trained by Cats Posts: 29,798 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives) (The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks) Code: 74\Name=Cleanup/Joins/Join to upper 74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d])</p>\\s<p\\b[^>]>([A-Z\xe2\x80\x9c\"])" 74\Replace=\\1 \\2 75\Name=Cleanup/Joins/To Lower 75\Find="\\s([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])" 75\Replace=\\1 \\2

Advert

Advert