Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.
I want to stay with the correct listing of potential errors (by checking the spelling of joined words).
Quote:
Originally Posted by Tex2002ans
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".
|
This should not be a problem. If I came across such a long word, I can jump there and check it manually. Although there are longer words in Polish than in English, 10-12 characters from the front and back should be enough to quickly check the context of a potential error.
Quote:
Originally Posted by Tex2002ans
Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).
|
I've already noticed that and strip the tags.
Quote:
Originally Posted by Tex2002ans
Manual checking is definitely a good idea.
|
Quote:
Originally Posted by Tex2002ans
So in the same book, you may have:
non-hyphenated + nonhyphenated
If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.
I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
- It could be a proper noun (an actual person named John-son, or a last name Johnson).
- It could be an exact quotation.
- Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
- Could be a speck of dust from the scan.
- Could be an actual typo.
- [...]
|
Exactly the point! In Polish, this list is unfortunately longer, among other things, because we have a relatively free word order.
It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results.
My current version:
Code:
soup = BS(xhtml, 'lxml')
for e in soup.findAll('br'): #Change BR to space (removes by default)
e.replace_with(' ')
for data in soup.find_all('p'):
stuff = data.get_text()
piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
for words in piece:
word1 = words[0]
word2 = re.sub(r'\s', '', words[1])
kontekst = word1 + word[1]
if words[1] != word2:
res = ''
res = bk.hspell.check(word2)
if res == 1:
if len(words) > 0:
print(words[1]," ---> ", kontekst)
It is not a perfect solution, but treat them as a basis for modification.
It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out.
Thank you very much for your interest in my problem.