View Single Post
Old 02-11-2018, 04:55 AM   #10
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 853
Karma: 3341026
Join Date: Jan 2017
Location: Poland
Device: Various
Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.

I want to stay with the correct listing of potential errors (by checking the spelling of joined words).

Quote:
Originally Posted by Tex2002ans View Post
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".
This should not be a problem. If I came across such a long word, I can jump there and check it manually. Although there are longer words in Polish than in English, 10-12 characters from the front and back should be enough to quickly check the context of a potential error.

Quote:
Originally Posted by Tex2002ans View Post
Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).
I've already noticed that and strip the tags.

Quote:
Originally Posted by Tex2002ans View Post
Manual checking is definitely a good idea.


Quote:
Originally Posted by Tex2002ans View Post
So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
  • It could be a proper noun (an actual person named John-son, or a last name Johnson).
  • It could be an exact quotation.
  • Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
  • Could be a speck of dust from the scan.
  • Could be an actual typo.
  • [...]
Exactly the point! In Polish, this list is unfortunately longer, among other things, because we have a relatively free word order.


It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results.

My current version:
Code:
            soup = BS(xhtml, 'lxml')
            for e in soup.findAll('br'):   #Change BR to space (removes by default)
                e.replace_with(' ')
            for data in soup.find_all('p'):
                stuff = data.get_text()
                piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
                for words in piece:
                    word1 = words[0]
                    word2 = re.sub(r'\s', '', words[1])
                    kontekst = word1 + word[1]
                    if words[1] != word2:
                        res = ''
                        res = bk.hspell.check(word2)
                        if res == 1:
                            if len(words) > 0:
                                print(words[1]," ---> ", kontekst)
It is not a perfect solution, but treat them as a basis for modification.
It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out.

Thank you very much for your interest in my problem.

Last edited by BeckyEbook; 02-11-2018 at 05:24 AM.
BeckyEbook is offline   Reply With Quote