MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   regex newbie search end of string char problem (https://www.mobileread.com/forums/showthread.php?t=333897)

michaelbr 10-10-2020 11:30 AM

regex newbie search end of string char problem
 
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, here is an example:
paragraph 1: .....
Code:

.’</p>
paragraph 2: .....
Code:

.</p>
paragraph 3: ......
Code:

</p>
the .... can be either char or number, I'd like to find only paragraph 3, I tried this regex
Code:

([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?

theducks 10-10-2020 12:28 PM

I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code:

74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2


michaelbr 10-10-2020 01:19 PM

Quote:

Originally Posted by theducks (Post 4045477)
I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code:

74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2


Thanks for the tips, it's solved.

Tex2002ans 10-12-2020 03:35 PM

Quote:

Originally Posted by michaelbr (Post 4045466)
I tried this regex
Code:

([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?

The . is a very special symbol in Regex. It stands for "any character". If you want to look for an actual period, you'll want to add a \ before it:

. = any character
\. = a period

Quote:

Originally Posted by michaelbr (Post 4045466)
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, [...]

Can you try to explain, in words, what's the issue you're trying to solve? And give a few more examples of before/after?

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:

<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>

After:

Code:

<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>

* * *

Here are the 3 sets of Regex I personally use:

Note: DO NOT do a "Replace All". Replace most of these on a case-by-case basis. Also, make sure to save a backup copy of your file.

Regex #1 (Hyphens)

This searches for a hyphen at the end of a paragraph:

Search: -</p>\s+<p>
Replace: (LEAVE THIS COMPLETELY BLANK)

OR alternate:

Search: -</p>\s+<p>
Replace: -

Example:

Code:

<p>This example is where the pre-</p>
<p>split occurs.</p>

Regex #2 (Not Closing Punctuation)

This searches for everything that's NOT a period, exclamation point, question mark, etc.:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

Example:

Code:

<p>This is an example</p>
<p>sentence where the person,</p>
<p>places, and things occur.</p>

Note: You can easily add different "valid" punctuation endings as needed. Like a colon may or may not be:

In Fiction, colons likely occur within sentences.
In Non-Fiction, colons likely occur at the end of paragraphs.

Regex #3 (Lowercase Start)

This searches for a lowercase letter at the very beginning of the paragraph:

Search: <p>[a-z]

I make sure to run this after #1 and #2 to catch any strays, then decide these on a case-by-case basis.

Example:

Code:

<p>The fishy “car dealership”</p>
<p>was called Mr. X’s Emporium.</p>


michaelbr 10-13-2020 03:30 PM

Quote:

Originally Posted by Tex2002ans (Post 4046154)
From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:

<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>

After:

Code:

<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>


Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do, I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead. Again thanks so much for sharing.

Tex2002ans 10-13-2020 04:48 PM

Quote:

Originally Posted by michaelbr (Post 4046616)
Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do,

Glad to see I guessed correctly.

Quote:

Originally Posted by michaelbr (Post 4046616)
I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead.

If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code:

<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>

but I think my Regexes are better. :P

michaelbr 10-15-2020 02:54 PM

Quote:

Originally Posted by Tex2002ans (Post 4046641)
Glad to see I guessed correctly.



If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code:

<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>

but I think my Regexes are better. :P

Yes, certainly, yours are much better, thanks for sharing.


All times are GMT -4. The time now is 10:53 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.