MobileRead Forums - View Single Post - Regex question -- All two consecutive words

Tex2002ans · 02-10-2018, 08:20 PM

Quote:

Originally Posted by Doitsu

You might want to check out NLTK, in particular the collocations module.

Quote:

Originally Posted by BeckyEbook

I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.

Hmmmm.... well you may want to read a lot more about n-grams and Polish. Perhaps you can find something useful:

N-Gram Collection from a Large-Scale Corpus of Polish Internet

Extended N-gram Model for Analysis of Polish Texts

LanguageTool also works by working on a list of rules:

https://languagetool.org/

I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet):

http://wiki.languagetool.org/finding...ng-n-gram-data

Quote:

Originally Posted by BeckyEbook

Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:

Code:

(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})

The expected result:

Code:

ght a new smart phone.
ve a very smart phone.

(0-10 characters of "context" around words).

Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

Quote:

Originally Posted by BeckyEbook

I can then jump to the first sentence and manually join the words.

Manual checking is definitely a good idea.

I ran into a similar issue with hyphenated/non-hyphenated words.

When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess.

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:

It could be a proper noun (an actual person named John-son, or a last name Johnson).
It could be an exact quotation.
Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
Could be a speck of dust from the scan.
Could be an actual typo.
[...]

Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.