MobileRead Forums - View Single Post - Regex question -- All two consecutive words

BeckyEbook · 02-11-2018, 04:55 AM

Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.

I want to stay with the correct listing of potential errors (by checking the spelling of joined words).

Quote:

Originally Posted by Tex2002ans

Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

This should not be a problem. If I came across such a long word, I can jump there and check it manually. Although there are longer words in Polish than in English, 10-12 characters from the front and back should be enough to quickly check the context of a potential error.

Quote:

Originally Posted by Tex2002ans

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

I've already noticed that and strip the tags.

Quote:

Originally Posted by Tex2002ans

Manual checking is definitely a good idea.

Quote:

Originally Posted by Tex2002ans

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:

It could be a proper noun (an actual person named John-son, or a last name Johnson).
It could be an exact quotation.
Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
Could be a speck of dust from the scan.
Could be an actual typo.
[...]

Exactly the point! In Polish, this list is unfortunately longer, among other things, because we have a relatively free word order.

It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results.

My current version:

Code:

            soup = BS(xhtml, 'lxml')
            for e in soup.findAll('br'):   #Change BR to space (removes by default)
                e.replace_with(' ')
            for data in soup.find_all('p'):
                stuff = data.get_text()
                piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
                for words in piece:
                    word1 = words[0]
                    word2 = re.sub(r'\s', '', words[1])
                    kontekst = word1 + word[1]
                    if words[1] != word2:
                        res = ''
                        res = bk.hspell.check(word2)
                        if res == 1:
                            if len(words) > 0:
                                print(words[1]," ---> ", kontekst)

It is not a perfect solution, but treat them as a basis for modification.
It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out.

Thank you very much for your interest in my problem.