02-09-2018, 01:27 PM | #1 |
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Regex question -- All two consecutive words
I have a text:
Code:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p> Code:
\b\w+\b[,;.\s]*\b\w+\b Code:
Lorem ipsum dolor sit amet, consectetur adipiscing elit Code:
Lorem ipsum ipsum dolor dolor sit sit amet amet, consectetur consectetur adipiscing adipiscing elit |
02-09-2018, 03:29 PM | #2 |
Sigil Developer
Posts: 7,636
Karma: 5433388
Join Date: Nov 2009
Device: many
|
See this short discussion of re's lookahead feature
https://stackoverflow.com/questions/...-with-a-regexp |
02-09-2018, 03:53 PM | #3 |
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Thank you. It works! Word of the day: overlapping.
|
02-09-2018, 04:32 PM | #4 |
Handy Elephant
Posts: 1,736
Karma: 26785668
Join Date: Dec 2009
Location: Southern Sweden, far out in the quiet woods
Device: Thinkpad E595, Ubuntu Mate, Huawei Mediapad 5, Bouye Likebook Plus
|
Is this the answer?
(?=(\b\w+\b[,;.\s]*\b\w+\b)) Or possibly... (?=(\b\w+\b)[,;.\s]*(\b\w+\b)) To get each word in a separate group. (?= ... ) is the magic dust... Positive lookahead. New for me... |
02-09-2018, 07:17 PM | #5 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Just wondering what the use-case is?
Are you trying to pull out all n-grams? |
02-10-2018, 07:12 AM | #6 | |
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Quote:
Code:
I bought a new smart phone. I have a very smart phone. In this case: Code:
smart + phone = smartphone Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly. My dream is to get a result close to: Code:
(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10}) Code:
ght a new smart phone. ve a very smart phone. I can then jump to the first sentence and manually join the words. |
|
02-10-2018, 07:36 AM | #7 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
You might want to check out NLTK, in particular the collocations module.
|
02-10-2018, 09:33 AM | #8 |
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Thank you. A lot to read, but maybe something can be used.
I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check. I think it's a good idea for a validating plugin. Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately. |
02-10-2018, 08:20 PM | #9 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
N-Gram Collection from a Large-Scale Corpus of Polish Internet Extended N-gram Model for Analysis of Polish Texts LanguageTool also works by working on a list of rules: https://languagetool.org/ I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet): http://wiki.languagetool.org/finding...ng-n-gram-data Quote:
Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.). Quote:
I ran into a similar issue with hyphenated/non-hyphenated words. When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess. So in the same book, you may have: non-hyphenated + nonhyphenated If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look. I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method. |
||||
02-11-2018, 04:55 AM | #10 | |||
Guru
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
|
Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.
I want to stay with the correct listing of potential errors (by checking the spelling of joined words). Quote:
Quote:
Quote:
It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results. My current version: Code:
soup = BS(xhtml, 'lxml') for e in soup.findAll('br'): #Change BR to space (removes by default) e.replace_with(' ') for data in soup.find_all('p'): stuff = data.get_text() piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff) for words in piece: word1 = words[0] word2 = re.sub(r'\s', '', words[1]) kontekst = word1 + word[1] if words[1] != word2: res = '' res = bk.hspell.check(word2) if res == 1: if len(words) > 0: print(words[1]," ---> ", kontekst) It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out. Thank you very much for your interest in my problem. Last edited by BeckyEbook; 02-11-2018 at 05:24 AM. |
|||
Tags |
regex |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to make regex to replace 2 spaces between words, with one space? | crankypants | Sigil | 4 | 10-29-2015 11:51 AM |
anybody got a Sigil regex for words with all caps? | Gregg Bell | Sigil | 4 | 10-10-2015 02:30 PM |
Help with Regex - find groups of words in uppercase | Hoods7070 | Sigil | 3 | 06-11-2013 08:41 AM |
Help with regex expression for words in all caps | bfollowell | Sigil | 9 | 01-20-2012 05:11 PM |
regex to fix up hyphenated words please | cybmole | Sigil | 2 | 01-06-2011 04:13 AM |