Regex question -- All two consecutive words

BeckyEbook · 02-09-2018, 01:27 PM

I have a text:

Code:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>

My code:

Code:

\b\w+\b[,;.\s]*\b\w+\b

gives the results:

Code:

Lorem ipsum
dolor sit
amet, consectetur
adipiscing elit

I need:

Code:

Lorem ipsum
ipsum dolor
dolor sit
sit amet
amet, consectetur
consectetur adipiscing
adipiscing elit

Is it possible at all?

KevinH · 02-09-2018, 03:29 PM

See this short discussion of re's lookahead feature

https://stackoverflow.com/questions/...-with-a-regexp

BeckyEbook · 02-09-2018, 03:53 PM

Thank you. It works! Word of the day: overlapping.

Adoby · 02-09-2018, 04:32 PM

Is this the answer?

(?=(\b\w+\b[,;.\s]*\b\w+\b))

Or possibly...

(?=(\b\w+\b)[,;.\s]*(\b\w+\b))

To get each word in a separate group.

(?= ... ) is the magic dust... Positive lookahead. New for me...

Tex2002ans · 02-09-2018, 07:17 PM

Just wondering what the use-case is?

Are you trying to pull out all n-grams?

BeckyEbook · 02-10-2018, 07:12 AM

Quote:

Originally Posted by Tex2002ans

Just wondering what the use-case is?

Are you trying to pull out all n-grams?

It's just an idea that's on my mind.

Code:

I bought a new smart phone.
I have a very smart phone.

I check all words and if two neighboring (connected with each other) exist in the dictionary - I display the results.

In this case:

Code:

smart + phone = smartphone

(In the first sentence should be "smartphone", in second is OK – written separately.)

Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:

Code:

(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})

The expected result:

Code:

ght a new smart phone.
ve a very smart phone.

(0-10 characters of "context" around words).

I can then jump to the first sentence and manually join the words.

Doitsu · 02-10-2018, 07:36 AM

Quote:

Originally Posted by BeckyEbook

It's just an idea that's on my mind.

You might want to check out NLTK, in particular the collocations module.

BeckyEbook · 02-10-2018, 09:33 AM

Thank you. A lot to read, but maybe something can be used.

I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.

Tex2002ans · 02-10-2018, 08:20 PM

Quote:

Originally Posted by Doitsu

You might want to check out NLTK, in particular the collocations module.

Quote:

Originally Posted by BeckyEbook

I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.

Hmmmm.... well you may want to read a lot more about n-grams and Polish. Perhaps you can find something useful:

N-Gram Collection from a Large-Scale Corpus of Polish Internet

Extended N-gram Model for Analysis of Polish Texts

LanguageTool also works by working on a list of rules:

https://languagetool.org/

I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet):

http://wiki.languagetool.org/finding...ng-n-gram-data

Quote:

Originally Posted by BeckyEbook

Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:

Code:

(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})

The expected result:

Code:

ght a new smart phone.
ve a very smart phone.

(0-10 characters of "context" around words).

Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

Quote:

Originally Posted by BeckyEbook

I can then jump to the first sentence and manually join the words.

Manual checking is definitely a good idea.

I ran into a similar issue with hyphenated/non-hyphenated words.

When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess.

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:

It could be a proper noun (an actual person named John-son, or a last name Johnson).
It could be an exact quotation.
Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
Could be a speck of dust from the scan.
Could be an actual typo.
[...]

Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.

BeckyEbook · 02-11-2018, 04:55 AM

Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.

I want to stay with the correct listing of potential errors (by checking the spelling of joined words).

Quote:

Originally Posted by Tex2002ans

Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

This should not be a problem. If I came across such a long word, I can jump there and check it manually. Although there are longer words in Polish than in English, 10-12 characters from the front and back should be enough to quickly check the context of a potential error.

Quote:

Originally Posted by Tex2002ans

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

I've already noticed that and strip the tags.

Quote:

Originally Posted by Tex2002ans

Manual checking is definitely a good idea.

Quote:

Originally Posted by Tex2002ans

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:

It could be a proper noun (an actual person named John-son, or a last name Johnson).
It could be an exact quotation.
Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
Could be a speck of dust from the scan.
Could be an actual typo.
[...]

Exactly the point! In Polish, this list is unfortunately longer, among other things, because we have a relatively free word order.

It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results.

My current version:

Code:

            soup = BS(xhtml, 'lxml')
            for e in soup.findAll('br'):   #Change BR to space (removes by default)
                e.replace_with(' ')
            for data in soup.find_all('p'):
                stuff = data.get_text()
                piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
                for words in piece:
                    word1 = words[0]
                    word2 = re.sub(r'\s', '', words[1])
                    kontekst = word1 + word[1]
                    if words[1] != word2:
                        res = ''
                        res = bk.hspell.check(word2)
                        if res == 1:
                            if len(words) > 0:
                                print(words[1]," ---> ", kontekst)

It is not a perfect solution, but treat them as a basis for modification.
It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out.

Thank you very much for your interest in my problem.

02-09-2018, 01:27 PM	#1
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	Regex question -- All two consecutive words I have a text: Code: <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p> My code: Code: \b\w+\b[,;.\s]*\b\w+\b gives the results: Code: Lorem ipsum dolor sit amet, consectetur adipiscing elit I need: Code: Lorem ipsum ipsum dolor dolor sit sit amet amet, consectetur consectetur adipiscing adipiscing elit Is it possible at all?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to make regex to replace 2 spaces between words, with one space?	crankypants	Sigil	4	10-29-2015 11:51 AM
anybody got a Sigil regex for words with all caps?	Gregg Bell	Sigil	4	10-10-2015 02:30 PM
Help with Regex - find groups of words in uppercase	Hoods7070	Sigil	3	06-11-2013 08:41 AM
Help with regex expression for words in all caps	bfollowell	Sigil	9	01-20-2012 05:11 PM
regex to fix up hyphenated words please	cybmole	Sigil	2	01-06-2011 04:13 AM

02-09-2018, 03:29 PM	#2
KevinH Sigil Developer Posts: 7,636 Karma: 5433388 Join Date: Nov 2009 Device: many	See this short discussion of re's lookahead feature https://stackoverflow.com/questions/...-with-a-regexp

02-09-2018, 03:53 PM	#3
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	Thank you. It works! Word of the day: overlapping.

02-09-2018, 04:32 PM	#4
Adoby Handy Elephant Posts: 1,736 Karma: 26785668 Join Date: Dec 2009 Location: Southern Sweden, far out in the quiet woods Device: Thinkpad E595, Ubuntu Mate, Huawei Mediapad 5, Bouye Likebook Plus	Is this the answer? (?=(\b\w+\b[,;.\s]\b\w+\b)) Or possibly... (?=(\b\w+\b)[,;.\s](\b\w+\b)) To get each word in a separate group. (?= ... ) is the magic dust... Positive lookahead. New for me...

02-09-2018, 07:17 PM	#5
Tex2002ans Wizard Posts: 2,297 Karma: 12126329 Join Date: Jul 2012 Device: Kobo Forma, Nook	Just wondering what the use-case is? Are you trying to pull out all n-grams?

02-10-2018, 09:33 AM	#8
BeckyEbook Guru Posts: 692 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	Thank you. A lot to read, but maybe something can be used. I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check. I think it's a good idea for a validating plugin. Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.