MobileRead Forums - View Single Post

Ruskie_it · 11-14-2018, 06:22 AM

Hello, I am dealing with a document probably scanned from paper, and I find so many words that have a dash among letters (the hypen sign): something like - italian text:

Il gat-to saltò giù dal letto e si incam-minò verso la porta, do-ve lo stavo aspettan-do.

I would like to remove them in a semi-automatic process, that is a regex search who can highlight them and if it's not a false positive, manually hitting Replace I would like to fix it.
However, I can't seem to make it work.
I have found the sticky message with saved searches and there are a couple that should do right this, but they don't seem to work for me.
For example, senhal in 2015 wrote this one:

Code:

"case_sensitive": false, 
      "dot_all": false, 
      "find": "(?s)([a-zàáèéìíòóùú])- *([a-zàáèéìíòóùú])(?![^<>]*>)(?!.*<body[^>]*>)", 
      "mode": "regex", 
      "name": "FIX: words with dash inside [del]", 
      "replace": "\\1\\2"

It correctly identifies oddly dashed words, but when I click "Replace" I got \1 and \2 replacing the offending text, which is wrong.
For example, if I apply the search and replace function above to the following words, see what I get:

disprezzar-lo ---> disprezza\1\2o
na-va ---> n\1\2e

Etc. etc.
Anyone can help?

Thank you so much
R.

11-14-2018, 06:22 AM	#1
Ruskie_it Fanatic Posts: 536 Karma: 1000000 Join Date: Dec 2011 Location: Rome, Italy Device: Kindle PW5, Kindle PW4, Kindle 4 NT	Remove dashes within words Hello, I am dealing with a document probably scanned from paper, and I find so many words that have a dash among letters (the hypen sign): something like - italian text: Il gat-to saltò giù dal letto e si incam-minò verso la porta, do-ve lo stavo aspettan-do. I would like to remove them in a semi-automatic process, that is a regex search who can highlight them and if it's not a false positive, manually hitting Replace I would like to fix it. However, I can't seem to make it work. I have found the sticky message with saved searches and there are a couple that should do right this, but they don't seem to work for me. For example, senhal in 2015 wrote this one: Code: "case_sensitive": false, "dot_all": false, "find": "(?s)([a-zàáèéìíòóùú])- ([a-zàáèéìíòóùú])(?![^<>]>)(?!.<body[^>]>)", "mode": "regex", "name": "FIX: words with dash inside [del]", "replace": "\\1\\2" It correctly identifies oddly dashed words, but when I click "Replace" I got \1 and \2 replacing the offending text, which is wrong. For example, if I apply the search and replace function above to the following words, see what I get: disprezzar-lo ---> disprezza\1\2o na-va ---> n\1\2e Etc. etc. Anyone can help? Thank you so much R.