|  02-09-2018, 01:27 PM | #1 | 
| Guru            Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various | 
				
				Regex question -- All two consecutive words
			 
			
			I have a text: Code: <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p> Code: \b\w+\b[,;.\s]*\b\w+\b Code: Lorem ipsum dolor sit amet, consectetur adipiscing elit Code: Lorem ipsum ipsum dolor dolor sit sit amet amet, consectetur consectetur adipiscing adipiscing elit | 
|   |   | 
|  02-09-2018, 03:29 PM | #2 | 
| Sigil Developer            Posts: 9,070 Karma: 6361556 Join Date: Nov 2009 Device: many | 
			
			See this short discussion of re's lookahead feature https://stackoverflow.com/questions/...-with-a-regexp | 
|   |   | 
|  02-09-2018, 03:53 PM | #3 | 
| Guru            Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various | 
			
			Thank you. It works! Word of the day: overlapping.
		 | 
|   |   | 
|  02-09-2018, 04:32 PM | #4 | 
| Handy Elephant            Posts: 1,737 Karma: 26785684 Join Date: Dec 2009 Location: Southern Sweden, far out in the quiet woods Device: Samsung Galaxy Tab S8 Ultra | 
			
			Is this the answer? (?=(\b\w+\b[,;.\s]*\b\w+\b)) Or possibly... (?=(\b\w+\b)[,;.\s]*(\b\w+\b)) To get each word in a separate group. (?= ... ) is the magic dust... Positive lookahead. New for me... | 
|   |   | 
|  02-09-2018, 07:17 PM | #5 | 
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | 
			
			Just wondering what the use-case is? Are you trying to pull out all n-grams? | 
|   |   | 
|  02-10-2018, 07:12 AM | #6 | |
| Guru            Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various | Quote: 
 Code: I bought a new smart phone. I have a very smart phone. In this case: Code: smart + phone = smartphone Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly. My dream is to get a result close to: Code: (.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})Code: ght a new smart phone. ve a very smart phone. I can then jump to the first sentence and manually join the words. | |
|   |   | 
|  02-10-2018, 07:36 AM | #7 | 
| Grand Sorcerer            Posts: 5,762 Karma: 24088559 Join Date: Dec 2010 Device: Kindle PW2 | 
			
			You might want to check out NLTK, in particular the collocations module.
		 | 
|   |   | 
|  02-10-2018, 09:33 AM | #8 | 
| Guru            Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various | 
			
			Thank you. A lot to read, but maybe something can be used. I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check. I think it's a good idea for a validating plugin. Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately. | 
|   |   | 
|  02-10-2018, 08:20 PM | #9 | ||||
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | Quote: 
  Quote: 
 N-Gram Collection from a Large-Scale Corpus of Polish Internet Extended N-gram Model for Analysis of Polish Texts LanguageTool also works by working on a list of rules: https://languagetool.org/ I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet): http://wiki.languagetool.org/finding...ng-n-gram-data Quote: 
 Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.). Quote: 
 I ran into a similar issue with hyphenated/non-hyphenated words. When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess. So in the same book, you may have: non-hyphenated + nonhyphenated If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look. I wouldn't trust a fully automatic replacement because there can be a lot of nuances: 
 Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.   | ||||
|   |   | 
|  02-11-2018, 04:55 AM | #10 | |||
| Guru            Posts: 899 Karma: 3501166 Join Date: Jan 2017 Location: Poland Device: Various | 
			
			Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills. I want to stay with the correct listing of potential errors (by checking the spelling of joined words). Quote: 
 Quote: 
  Quote: 
 It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results. My current version: Code:             soup = BS(xhtml, 'lxml')
            for e in soup.findAll('br'):   #Change BR to space (removes by default)
                e.replace_with(' ')
            for data in soup.find_all('p'):
                stuff = data.get_text()
                piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
                for words in piece:
                    word1 = words[0]
                    word2 = re.sub(r'\s', '', words[1])
                    kontekst = word1 + word[1]
                    if words[1] != word2:
                        res = ''
                        res = bk.hspell.check(word2)
                        if res == 1:
                            if len(words) > 0:
                                print(words[1]," ---> ", kontekst)It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out. Thank you very much for your interest in my problem. Last edited by BeckyEbook; 02-11-2018 at 05:24 AM. | |||
|   |   | 
|  | 
| Tags | 
| regex | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| How to make regex to replace 2 spaces between words, with one space? | crankypants | Sigil | 4 | 10-29-2015 11:51 AM | 
| anybody got a Sigil regex for words with all caps? | Gregg Bell | Sigil | 4 | 10-10-2015 02:30 PM | 
| Help with Regex - find groups of words in uppercase | Hoods7070 | Sigil | 3 | 06-11-2013 08:41 AM | 
| Help with regex expression for words in all caps | bfollowell | Sigil | 9 | 01-20-2012 05:11 PM | 
| regex to fix up hyphenated words please | cybmole | Sigil | 2 | 01-06-2011 04:13 AM |