Quote:
Originally Posted by Doitsu
|
Quote:
Originally Posted by BeckyEbook
I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.
I think it's a good idea for a validating plugin.
Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.
|
Hmmmm.... well you may want to read a lot more about n-grams and Polish. Perhaps you can find something useful:
N-Gram Collection from a Large-Scale Corpus of Polish Internet
Extended N-gram Model for Analysis of Polish Texts
LanguageTool also works by working on a list of rules:
https://languagetool.org/
I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet):
http://wiki.languagetool.org/finding...ng-n-gram-data
Quote:
Originally Posted by BeckyEbook
Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.
My dream is to get a result close to:
Code:
(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})
The expected result:
Code:
ght a new smart phone.
ve a very smart phone.
(0-10 characters of "context" around words).
|
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".
Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).
Quote:
Originally Posted by BeckyEbook
I can then jump to the first sentence and manually join the words.
|
Manual checking is definitely a good idea.
I ran into a similar issue with hyphenated/non-hyphenated words.
When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess.
So in the same book, you may have:
non-hyphenated + nonhyphenated
If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.
I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
- It could be a proper noun (an actual person named John-son, or a last name Johnson).
- It could be an exact quotation.
- Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
- Could be a speck of dust from the scan.
- Could be an actual typo.
- [...]
Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.