Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 01-28-2017, 06:00 AM   #1
scratch
Junior Member
scratch began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2017
Location: Austria
Device: none.
RegEx-Function and hyphenation problem

Hello to everybody.

So first a real big thank-you-all to the community and Kovid. You provide me with so many useful tipps and advices about calibre! I visit the forum since years and learned a lot. In fact - almost every problem that ever occured to me was asked and solved by some members sometimes. But now I have a problem which I simply cannot figure out because I don't really understand Phyton (and yes I try since some weeks ).

So here's the problem. Sometimes after a scan and ocr there may be many words with false divisions. So in German (which is my main paper-book-library) like 'Maschi ne' instead of 'Maschine'.
It would be great to have a Phyton expression:

If to words divided by space are _not_ found in the dictionary pull them together but only if the new word _is_ in the dictionary.

Example in German:
'betrach ten' should then look like 'betrachten' but
'Josef Bankl' shouldn't be converted to JosefBankl

In this case the Editors inbuilt Phyton-Expression is not really helpful because it pulls all words together which are part of the dictionary. Like making 'nachdem' from 'nach dem'. But these have different meanings and should remain untouched.

So hopefully I'm not the only one with this problem and it's not a big waste of your time to think about it.
And sorry for my miserable english.
Any advice would be great.
Sincerely, Steve
scratch is offline   Reply With Quote
Old 01-28-2017, 07:26 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,600
Karma: 28548974
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It would basically work lke this example, https://manual.calibre-ebook.com/fun...phenated-words except that you have to change it to look for words separated by spaces instead of hyphens.
kovidgoyal is offline   Reply With Quote
Old 01-28-2017, 08:15 AM   #3
scratch
Junior Member
scratch began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2017
Location: Austria
Device: none.
Thank you for your quick answer.
I tried this already. What I changed was

(\w+)\s*-\s*(\w+)
to simple this line
(\w+)\s(\w+)

But unfortunately it does not work. Maybe it's due to me beeing complete Python blind.

And there should also be a line which asks if neighbouring words are both not in the library before linking - which I cannot see in the example (or it's there and I do not understand it)
Thanks anyway

...and now I noticed something else.
In this sentence
<p>Solange Menschen auf die Welt kommen</p>
(\w+)\s(\w+)
finds word number 1+2 then 3+4 and then 5+6
so if a mistake would be between #2-#3 it is ignored.
Like this
<p>Solange Men schen auf die Welt kommen</p>
I understand why - but I don't be not able to find out how to avoid this
Again - any further advice is welcom.

Last edited by scratch; 01-28-2017 at 09:30 AM.
scratch is offline   Reply With Quote
Old 01-28-2017, 12:22 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,600
Karma: 28548974
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I dont have the time to write the function for you, but it would go something like this:
Code:
words = text.split()
i = 0
while i < len(words) - 1:
      w1, w2 = words[i:i+2]
      if not dictionaries.recognized(w1) and not dictionaries.recognized(w2) and dictionaries.recognized(w1 + w2):
         words[i] = w1 + w2
         words[i+1] = ''
         i += 1
      i += 1
return ' '.join(words)
kovidgoyal is offline   Reply With Quote
Old 01-28-2017, 12:44 PM   #5
scratch
Junior Member
scratch began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2017
Location: Austria
Device: none.
Thank you for your kind advice.
This gives me some hints to think about for the next days.
And BTW
Thank you for calibre which is simply the best!!
scratch is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
regex-function convert roman numerals weberr Editor 11 09-22-2021 05:15 PM
RegEx Function: Title Case phossler Editor 29 07-04-2020 10:52 AM
Regex Function about «» and “” senhal Editor 8 04-06-2016 02:12 AM
Regex Function - Split unknown word Paulie_D Editor 19 12-07-2014 05:12 AM
Using regex for more elegant hyphenation and word wrap Psymon Sigil 23 12-01-2014 07:27 PM


All times are GMT -4. The time now is 08:41 AM.


MobileRead.com is a privately owned, operated and funded community.