Thread: a little help
View Single Post
Old 08-11-2022, 08:52 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by DNSB View Post
In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex:
Good steps. I agree.

Last year, I wrote this post describing my 3-step "merge 'broken' paragraphs" method:

I deal with a lot of OCR from PDFs, so fixing up all that mess is very common.

I also listed a ton of other advanced tricks + common errors to look out for.

Quote:
Originally Posted by Sarmat89 View Post
You should use \p{Ll} instead of [a-z], though.
Sometimes it's better to use more "human-readable" examples instead of "correct, but extremely cryptic" regex... especially when teaching complete noobs.

I remember when I first learned about regex, they consistently used complicated "email verification" examples... where you have no idea what sort of voodoo made it work.

It wasn't until years later, when actually working on the OCR stuff, that I figured out the true power of regex by building up from the very basic building blocks.

- - -

PS. If you want even more regex tips, I usually color-code and give step-by-step breakdowns of my examples. See:

Last edited by Tex2002ans; 08-11-2022 at 09:04 PM.
Tex2002ans is offline   Reply With Quote