Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 02-09-2018, 01:27 PM   #1
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
Regex question -- All two consecutive words

I have a text:
Code:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
My code:
Code:
\b\w+\b[,;.\s]*\b\w+\b
gives the results:
Code:
Lorem ipsum
dolor sit
amet, consectetur
adipiscing elit
I need:
Code:
Lorem ipsum
ipsum dolor
dolor sit
sit amet
amet, consectetur
consectetur adipiscing
adipiscing elit
Is it possible at all?
BeckyEbook is online now   Reply With Quote
Old 02-09-2018, 03:29 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,636
Karma: 5433388
Join Date: Nov 2009
Device: many
See this short discussion of re's lookahead feature

https://stackoverflow.com/questions/...-with-a-regexp
KevinH is offline   Reply With Quote
Old 02-09-2018, 03:53 PM   #3
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
Thank you. It works! Word of the day: overlapping.
BeckyEbook is online now   Reply With Quote
Old 02-09-2018, 04:32 PM   #4
Adoby
Handy Elephant
Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.Adoby ought to be getting tired of karma fortunes by now.
 
Adoby's Avatar
 
Posts: 1,736
Karma: 26785668
Join Date: Dec 2009
Location: Southern Sweden, far out in the quiet woods
Device: Thinkpad E595, Ubuntu Mate, Huawei Mediapad 5, Bouye Likebook Plus
Is this the answer?

(?=(\b\w+\b[,;.\s]*\b\w+\b))

Or possibly...

(?=(\b\w+\b)[,;.\s]*(\b\w+\b))

To get each word in a separate group.

(?= ... ) is the magic dust... Positive lookahead. New for me...
Adoby is offline   Reply With Quote
Old 02-09-2018, 07:17 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Just wondering what the use-case is?

Are you trying to pull out all n-grams?
Tex2002ans is offline   Reply With Quote
Old 02-10-2018, 07:12 AM   #6
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
Quote:
Originally Posted by Tex2002ans View Post
Just wondering what the use-case is?

Are you trying to pull out all n-grams?
It's just an idea that's on my mind.

Code:
I bought a new smart phone.
I have a very smart phone.
I check all words and if two neighboring (connected with each other) exist in the dictionary - I display the results.

In this case:
Code:
smart + phone = smartphone
(In the first sentence should be "smartphone", in second is OK – written separately.)

Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:
Code:
(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})
The expected result:
Code:
ght a new smart phone.
ve a very smart phone.
(0-10 characters of "context" around words).

I can then jump to the first sentence and manually join the words.
BeckyEbook is online now   Reply With Quote
Old 02-10-2018, 07:36 AM   #7
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by BeckyEbook View Post
It's just an idea that's on my mind.
You might want to check out NLTK, in particular the collocations module.
Doitsu is offline   Reply With Quote
Old 02-10-2018, 09:33 AM   #8
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
Thank you. A lot to read, but maybe something can be used.

I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.
BeckyEbook is online now   Reply With Quote
Old 02-10-2018, 08:20 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Doitsu View Post
You might want to check out NLTK, in particular the collocations module.


Quote:
Originally Posted by BeckyEbook View Post
I want to use it for the Polish language. The grammar of the Polish language is characterized by a high degree of inflection, and it has a relatively free word order, so I would like to get the results for a manual check.

I think it's a good idea for a validating plugin.

Initial tests are unfortunately not very encouraging - there are a lot of words in Polish which mean something different when they are written together, and different when they are written separately.
Hmmmm.... well you may want to read a lot more about n-grams and Polish. Perhaps you can find something useful:

N-Gram Collection from a Large-Scale Corpus of Polish Internet

Extended N-gram Model for Analysis of Polish Texts

LanguageTool also works by working on a list of rules:

https://languagetool.org/

I see Polish is on the list of supported languages, and they have quite a bit of rules. Sadly, no n-gram support for Polish (yet):

http://wiki.languagetool.org/finding...ng-n-gram-data

Quote:
Originally Posted by BeckyEbook View Post
Of course, EVERYTHING depends on the context and this context does not manage to "catch" correctly.

My dream is to get a result close to:
Code:
(.{0,10})(?=(\b\w+\b[,;.\s]*\b\w+\b))(.{0,10})
The expected result:
Code:
ght a new smart phone.
ve a very smart phone.
(0-10 characters of "context" around words).
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".

Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).

Quote:
Originally Posted by BeckyEbook View Post
I can then jump to the first sentence and manually join the words.
Manual checking is definitely a good idea.

I ran into a similar issue with hyphenated/non-hyphenated words.

When you OCR a book, it has no idea if the hyphen at the end of a line is an actual hard hyphen or a soft hyphen, so the OCR has to guess.

So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
  • It could be a proper noun (an actual person named John-son, or a last name Johnson).
  • It could be an exact quotation.
  • Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
  • Could be a speck of dust from the scan.
  • Could be an actual typo.
  • [...]

Same sort of issue with this two- or three-word errors you are looking for. You could have a "non hyphenated" word that could be caught using that method.
Tex2002ans is offline   Reply With Quote
Old 02-11-2018, 04:55 AM   #10
BeckyEbook
Guru
BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.BeckyEbook ought to be getting tired of karma fortunes by now.
 
BeckyEbook's Avatar
 
Posts: 692
Karma: 2180740
Join Date: Jan 2017
Location: Poland
Device: Misc
Thank you very much for the new sources, but the complexity of this subject (automatic processing) is much above my skills.

I want to stay with the correct listing of potential errors (by checking the spelling of joined words).

Quote:
Originally Posted by Tex2002ans View Post
Probably better to grab a handful of words to the left and right. X amount of characters would run into an issue if you had a very large word: "supercalifragilisticexpialidocious smart phone".
This should not be a problem. If I came across such a long word, I can jump there and check it manually. Although there are longer words in Polish than in English, 10-12 characters from the front and back should be enough to quickly check the context of a potential error.

Quote:
Originally Posted by Tex2002ans View Post
Side Note: Also, most of what I recall seeing only works on Plain Text only... I can't say I ever ran across one that also takes into account typical HTML formatting (italics, superscripts, etc., etc.).
I've already noticed that and strip the tags.

Quote:
Originally Posted by Tex2002ans View Post
Manual checking is definitely a good idea.


Quote:
Originally Posted by Tex2002ans View Post
So in the same book, you may have:

non-hyphenated + nonhyphenated

If both forms exist, my "hyphenation program" lets me know. Then I can take a closer look.

I wouldn't trust a fully automatic replacement because there can be a lot of nuances:
  • It could be a proper noun (an actual person named John-son, or a last name Johnson).
  • It could be an exact quotation.
  • Could be a perfectly valid old-timey way of spelling it (to-morrow, to-day, co-operation).
  • Could be a speck of dust from the scan.
  • Could be an actual typo.
  • [...]
Exactly the point! In Polish, this list is unfortunately longer, among other things, because we have a relatively free word order.


It sounds very interesting, but in the case of such errors (just a smart phone is trivial), I have to manually check. The problem is getting results.

My current version:
Code:
            soup = BS(xhtml, 'lxml')
            for e in soup.findAll('br'):   #Change BR to space (removes by default)
                e.replace_with(' ')
            for data in soup.find_all('p'):
                stuff = data.get_text()
                piece = re.findall(r'(.{0,10})(?=(\b\w+\b\s+\b\w+\b))',stuff)
                for words in piece:
                    word1 = words[0]
                    word2 = re.sub(r'\s', '', words[1])
                    kontekst = word1 + word[1]
                    if words[1] != word2:
                        res = ''
                        res = bk.hspell.check(word2)
                        if res == 1:
                            if len(words) > 0:
                                print(words[1]," ---> ", kontekst)
It is not a perfect solution, but treat them as a basis for modification.
It depends on such a code improvement to have a context on both sides, and unfortunately it does not work out.

Thank you very much for your interest in my problem.

Last edited by BeckyEbook; 02-11-2018 at 05:24 AM.
BeckyEbook is online now   Reply With Quote
Reply

Tags
regex


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to make regex to replace 2 spaces between words, with one space? crankypants Sigil 4 10-29-2015 11:51 AM
anybody got a Sigil regex for words with all caps? Gregg Bell Sigil 4 10-10-2015 02:30 PM
Help with Regex - find groups of words in uppercase Hoods7070 Sigil 3 06-11-2013 08:41 AM
Help with regex expression for words in all caps bfollowell Sigil 9 01-20-2012 05:11 PM
regex to fix up hyphenated words please cybmole Sigil 2 01-06-2011 04:13 AM


All times are GMT -4. The time now is 06:01 PM.


MobileRead.com is a privately owned, operated and funded community.