View Single Post
Old 11-14-2018, 06:22 AM   #1
Ruskie_it
Fanatic
Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.Ruskie_it ought to be getting tired of karma fortunes by now.
 
Posts: 536
Karma: 1000000
Join Date: Dec 2011
Location: Rome, Italy
Device: Kindle PW5, Kindle PW4, Kindle 4 NT
Remove dashes within words

Hello, I am dealing with a document probably scanned from paper, and I find so many words that have a dash among letters (the hypen sign): something like - italian text:

Il gat-to saltò giù dal letto e si incam-minò verso la porta, do-ve lo stavo aspettan-do.

I would like to remove them in a semi-automatic process, that is a regex search who can highlight them and if it's not a false positive, manually hitting Replace I would like to fix it.
However, I can't seem to make it work.
I have found the sticky message with saved searches and there are a couple that should do right this, but they don't seem to work for me.
For example, senhal in 2015 wrote this one:

Code:
"case_sensitive": false, 
      "dot_all": false, 
      "find": "(?s)([a-zàáèéìíòóùú])- *([a-zàáèéìíòóùú])(?![^<>]*>)(?!.*<body[^>]*>)", 
      "mode": "regex", 
      "name": "FIX: words with dash inside [del]", 
      "replace": "\\1\\2"
It correctly identifies oddly dashed words, but when I click "Replace" I got \1 and \2 replacing the offending text, which is wrong.
For example, if I apply the search and replace function above to the following words, see what I get:

disprezzar-lo ---> disprezza\1\2o
na-va ---> n\1\2e

Etc. etc.
Anyone can help?

Thank you so much
R.
Ruskie_it is offline   Reply With Quote