View Single Post
Old 10-12-2020, 02:35 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by michaelbr View Post
I tried this regex
Code:
([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?
The . is a very special symbol in Regex. It stands for "any character". If you want to look for an actual period, you'll want to add a \ before it:

. = any character
\. = a period

Quote:
Originally Posted by michaelbr View Post
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, [...]
Can you try to explain, in words, what's the issue you're trying to solve? And give a few more examples of before/after?

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:
<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>
After:

Code:
<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>
* * *

Here are the 3 sets of Regex I personally use:

Note: DO NOT do a "Replace All". Replace most of these on a case-by-case basis. Also, make sure to save a backup copy of your file.

Regex #1 (Hyphens)

This searches for a hyphen at the end of a paragraph:

Search: -</p>\s+<p>
Replace: (LEAVE THIS COMPLETELY BLANK)

OR alternate:

Search: -</p>\s+<p>
Replace: -

Example:

Code:
<p>This example is where the pre-</p>
<p>split occurs.</p>
Regex #2 (Not Closing Punctuation)

This searches for everything that's NOT a period, exclamation point, question mark, etc.:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

Example:

Code:
<p>This is an example</p>
<p>sentence where the person,</p>
<p>places, and things occur.</p>
Note: You can easily add different "valid" punctuation endings as needed. Like a colon may or may not be:

In Fiction, colons likely occur within sentences.
In Non-Fiction, colons likely occur at the end of paragraphs.

Regex #3 (Lowercase Start)

This searches for a lowercase letter at the very beginning of the paragraph:

Search: <p>[a-z]

I make sure to run this after #1 and #2 to catch any strays, then decide these on a case-by-case basis.

Example:

Code:
<p>The fishy “car dealership”</p>
<p>was called Mr. X’s Emporium.</p>

Last edited by Tex2002ans; 10-12-2020 at 02:42 PM.
Tex2002ans is offline   Reply With Quote