View Single Post
Old 02-10-2018, 08:20 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Doitsu View Post
You might want to check out NLTK, in particular the collocations module.


Quote:
Originally Posted by BeckyEbook View Post
I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.
Hmmmm.... well you may want to read a lot more about n-grams and Polish. Perhaps you can find something useful:

N-Gram Collection from a Large-Scale Corpus of Polish Internet

Extended N-gram Model for Analysis of Polish Texts

LanguageTool also works by working on a list of rules:

https://languagetool.org/

I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet):

http://wiki.languagetool.org/finding...ng-n-gram-data

Quote:
Originally Posted by BeckyEbook View Post
Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:
Code:
(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})
The expected result:
Code:
ght a new smart phone.
ve a very smart phone.
(0-10 characters of "context" around words).
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

Quote:
Originally Posted by BeckyEbook View Post
I can then jump to the first sentence and manually join the words.
Manual checking is definitely a good idea.

I ran into a similar issue with hyphenated/non-hyphenated words.

When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess.

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
  • It could be a proper noun (an actual person named John-son, or a last name Johnson).
  • It could be an exact quotation.
  • Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
  • Could be a speck of dust from the scan.
  • Could be an actual typo.
  • [...]

Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.
Tex2002ans is offline   Reply With Quote