Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 10-10-2020, 10:30 AM   #1
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
regex newbie search end of string char problem

I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, here is an example:
paragraph 1: .....
Code:
.’</p>
paragraph 2: .....
Code:
.</p>
paragraph 3: ......
Code:
</p>
the .... can be either char or number, I'd like to find only paragraph 3, I tried this regex
Code:
([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?
michaelbr is offline   Reply With Quote
Old 10-10-2020, 11:28 AM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,798
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code:
74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2
theducks is online now   Reply With Quote
Advert
Old 10-10-2020, 12:19 PM   #3
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by theducks View Post
I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code:
74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2

75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2
Thanks for the tips, it's solved.
michaelbr is offline   Reply With Quote
Old 10-12-2020, 02:35 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by michaelbr View Post
I tried this regex
Code:
([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?
The . is a very special symbol in Regex. It stands for "any character". If you want to look for an actual period, you'll want to add a \ before it:

. = any character
\. = a period

Quote:
Originally Posted by michaelbr View Post
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, [...]
Can you try to explain, in words, what's the issue you're trying to solve? And give a few more examples of before/after?

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:
<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>
After:

Code:
<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>
* * *

Here are the 3 sets of Regex I personally use:

Note: DO NOT do a "Replace All". Replace most of these on a case-by-case basis. Also, make sure to save a backup copy of your file.

Regex #1 (Hyphens)

This searches for a hyphen at the end of a paragraph:

Search: -</p>\s+<p>
Replace: (LEAVE THIS COMPLETELY BLANK)

OR alternate:

Search: -</p>\s+<p>
Replace: -

Example:

Code:
<p>This example is where the pre-</p>
<p>split occurs.</p>
Regex #2 (Not Closing Punctuation)

This searches for everything that's NOT a period, exclamation point, question mark, etc.:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

Example:

Code:
<p>This is an example</p>
<p>sentence where the person,</p>
<p>places, and things occur.</p>
Note: You can easily add different "valid" punctuation endings as needed. Like a colon may or may not be:

In Fiction, colons likely occur within sentences.
In Non-Fiction, colons likely occur at the end of paragraphs.

Regex #3 (Lowercase Start)

This searches for a lowercase letter at the very beginning of the paragraph:

Search: <p>[a-z]

I make sure to run this after #1 and #2 to catch any strays, then decide these on a case-by-case basis.

Example:

Code:
<p>The fishy “car dealership”</p>
<p>was called Mr. X’s Emporium.</p>

Last edited by Tex2002ans; 10-12-2020 at 02:42 PM.
Tex2002ans is offline   Reply With Quote
Old 10-13-2020, 02:30 PM   #5
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by Tex2002ans View Post
From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code:
<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>
After:

Code:
<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>
Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do, I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead. Again thanks so much for sharing.
michaelbr is offline   Reply With Quote
Advert
Old 10-13-2020, 03:48 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by michaelbr View Post
Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do,
Glad to see I guessed correctly.

Quote:
Originally Posted by michaelbr View Post
I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead.
If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code:
<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>
but I think my Regexes are better. :P
Tex2002ans is offline   Reply With Quote
Old 10-15-2020, 01:54 PM   #7
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 77
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by Tex2002ans View Post
Glad to see I guessed correctly.



If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code:
<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>
but I think my Regexes are better. :P
Yes, certainly, yours are much better, thanks for sharing.
michaelbr is offline   Reply With Quote
Reply

Tags
regex, search criteria


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex in search problems (NOT Search&Replace; the search bar) lairdb Calibre 3 03-15-2017 07:10 PM
Search regex problem ColMac Editor 23 04-17-2015 03:58 PM
Regex Problem / Line that does't end with .</p> mcam77 Sigil 6 03-25-2013 06:38 PM
Regex - replace only part of a string - how? flameproof Sigil 11 02-23-2012 04:43 AM
My RegEx isn't doing what I hoped to remove page numbers and a fixed string winterminute Calibre 6 12-19-2010 10:55 PM


All times are GMT -4. The time now is 06:09 PM.


MobileRead.com is a privately owned, operated and funded community.