MobileRead Forums - View Single Post - removing the paragraphs tags if paragraph starts with lower case

lomkiri · 11-02-2024, 02:48 PM

I realized that if there was a succession of several paragraphs all beginning with a lowercase letter, my regex will capture only one every two, because the pointer will stop after the </p>, so the regex won't target the next paragraph, but will go on and find only the second next one, leaving one unchanged. It would be then necessary to make various passages to target all of them in the sequence (not a big deal, but unesthetic).

This can easily be resolved if we don't capture the last </p>, but use a positive lookahead (for </p>) instead, so the pointer will stop before the </p>, and the regex is ready to capture the next paragraph if it is a candidate.

With this regex, all paragraphs will be targeted during the first passage :

Code:

</p>\s*<p[^>]*>(\p{Ll}.*?)(?=</p>)

or, if we want to target as well paragraphs starting with <space><lowercase>:

Code:

</p>\s*<p[^>]*>(\s?\p{Ll}.*?)(?=</p>)

Replace is still the same: \x20\1
(\x20 is a space)

11-02-2024, 02:48 PM	#5
lomkiri Groupie Posts: 173 Karma: 1497966 Join Date: Jul 2021 Device: N/A	I realized that if there was a succession of several paragraphs all beginning with a lowercase letter, my regex will capture only one every two, because the pointer will stop after the </p>, so the regex won't target the next paragraph, but will go on and find only the second next one, leaving one unchanged. It would be then necessary to make various passages to target all of them in the sequence (not a big deal, but unesthetic). This can easily be resolved if we don't capture the last </p>, but use a positive lookahead (for </p>) instead, so the pointer will stop before the </p>, and the regex is ready to capture the next paragraph if it is a candidate. With this regex, all paragraphs will be targeted during the first passage : Code: </p>\s<p[^>]>(\p{Ll}.?)(?=</p>) or, if we want to target as well paragraphs starting with <space><lowercase>: Code: </p>\s<p[^>]>(\s?\p{Ll}.?)(?=</p>) Replace is still the same: \x20\1 (\x20 is a space) Last edited by lomkiri; 11-02-2024 at 05:26 PM.