03-10-2023, 10:51 AM | #1 |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
How to fix missing spaces between dictionary words
I have this book where many thousands of spaces are missing between words. For instance: hehad, thereis, ina, withme, hecame, theworld, ... In the standard editor (or in some plug-in perhaps), is there a way to use a regex function that can (1) find words that are not present in the standard dictionary, but that consist of two words that are present in the dic, and (2) propose a change? I have read the doc on Function mode for Search & replace in the Editor, but I don't immediately see how this could be done. I'm sure someone must have had this problem before... Tia for any tips. Last edited by DVdm; 03-10-2023 at 10:55 AM. Reason: (grammar) |
03-10-2023, 11:06 AM | #2 |
Well trained by Cats
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
The editor finds misspelled words and proposes changes. (tick show only....)
One of those are to split into 2 words (not 100%, but 2 valid words) Note: This does not find <tagged> runtogethers |
03-10-2023, 11:34 AM | #3 | |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
Quote:
I'd like to make all the changes in one blow, and then look for any induced mistakes. |
|
03-10-2023, 12:50 PM | #4 |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
Thanks.
I replied, but my message doesn't turn up. |
03-10-2023, 02:03 PM | #5 |
Wizard
Posts: 1,095
Karma: 4911876
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
I have found this issue in quite a few books. Sometimes it looks like a previous editor has simply deleted all the hyphens in the book.
I use the "Check Spelling" function and simply scroll down the list of all mis-spelled words and fix them. Alt-F7 or Tools►Check spelling to access it. Its a lot easier than trying to check every page of the book. |
03-10-2023, 02:18 PM | #6 |
Wizard
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
There's a RegEx function called 'SplitWords' that tries to divide using the dictionary
https://www.mobileread.com/forums/sh...ht=split+words Post #9 by the master I've never used it so let me know how it goes Code:
import regex from calibre import replace_entities, prepare_string_for_xml def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): def fix_word(m): word = m.group() if dictionaries.recognized(word): return word for i in xrange(1, len(word) - 1): a, b = word[:i], word[i:] if dictionaries.recognized(a) and dictionaries.recognized(b): return a + ' ' + b return word text = replace_entities(match.group(1)) text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1) text = prepare_string_for_xml(text) return '>' + text + '<' |
03-10-2023, 06:04 PM | #7 | |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
Quote:
Code:
return '>' + text + '<' Code:
return text Code:
\b(\w+)\b(?![^<>{}]*[>}]) Code:
NameError: name 'xrange' is not defined Last edited by DVdm; 03-10-2023 at 06:08 PM. Reason: added used search string |
|
03-10-2023, 08:42 PM | #8 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
change xrange to range in that function
|
03-11-2023, 05:45 AM | #9 |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
|
03-11-2023, 05:26 PM | #10 |
Enthusiast
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
|
Using the regex function, it took me a few hours to fix the book, and then I realised something.
This was a book that I had found as a pdf, which I had converted to epub with Calibre. In an edit session I noticed that there were a bunch of paragraphs with some kind of hardcoded linefeeds. With a general search and replace, I replaced all linefeeds with nothing, effectively deleting them, and then did a global beautifying files. Stupid. I should have replaced the linefeeds with spaces. So I retrieved the original pfd from my backups, converted to epub, replaced all linefeeds with a space, and beautified all files. Ready. Silly me! Last edited by DVdm; 03-12-2023 at 05:20 AM. Reason: spelling |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extra spaces between words | Drybonz | Conversion | 4 | 12-14-2015 08:15 PM |
How to make regex to replace 2 spaces between words, with one space? | crankypants | Sigil | 4 | 10-29-2015 11:51 AM |
Missing spaces between words | giwqnbha | Calibre | 2 | 10-18-2015 05:24 AM |
spaces introduced into middle of words in PDF conversion | paulrw | 1 | 11-06-2012 02:59 PM | |
Troubleshooting can't make any spaces between words in my novel. | fantaxy | Amazon Kindle | 2 | 08-03-2011 10:38 AM |