Quote:
Originally Posted by DNSB
In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex:
|
Good steps. I agree.
Last year, I wrote this post describing my 3-step "merge 'broken' paragraphs" method:
I deal with a lot of OCR from PDFs, so fixing up all that mess is very common.
I also listed a ton of other advanced tricks + common errors to look out for.
Quote:
Originally Posted by Sarmat89
You should use \p{Ll} instead of [a-z], though.
|
Sometimes it's better to use more "human-readable" examples instead of "correct, but extremely cryptic" regex... especially when teaching complete noobs.
I remember when I first learned about regex, they consistently used complicated "email verification" examples... where you have no idea what sort of voodoo made it work.
It wasn't until years later, when actually working on the OCR stuff, that I figured out the true power of regex by building up from the very basic building blocks.
- - -
PS. If you want even more regex tips, I usually color-code and give step-by-step breakdowns of my examples. See: