Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 03-10-2023, 10:51 AM   #1
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
How to fix missing spaces between dictionary words


I have this book where many thousands of spaces are missing between words. For instance: hehad, thereis, ina, withme, hecame, theworld, ...
In the standard editor (or in some plug-in perhaps), is there a way to use a regex function that can (1) find words that are not present in the standard dictionary, but that consist of two words that are present in the dic, and (2) propose a change?
I have read the doc on Function mode for Search & replace in the Editor, but I don't immediately see how this could be done. I'm sure someone must have had this problem before...
Tia for any tips.

Last edited by DVdm; 03-10-2023 at 10:55 AM. Reason: (grammar)
DVdm is offline   Reply With Quote
Old 03-10-2023, 11:06 AM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
The editor finds misspelled words and proposes changes. (tick show only....)

One of those are to split into 2 words (not 100%, but 2 valid words)
Note: This does not find <tagged> runtogethers
theducks is offline   Reply With Quote
Old 03-10-2023, 11:34 AM   #3
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
Quote:
Originally Posted by theducks View Post
The editor finds misspelled words and proposes changes. (tick show only....)

One of those are to split into 2 words (not 100%, but 2 valid words)
Note: This does not find <tagged> runtogethers
Yes, thanks, I had found that, but even this will takes many hours. I would like the editor to go ahead and do all corrections, provided there are indeed two words. For instance, for bethe its choices are Bethe, bathe, Lethe, be the, be-the,... , where only the 4th item is the one I need, two separate existing words.
I'd like to make all the changes in one blow, and then look for any induced mistakes.
DVdm is offline   Reply With Quote
Old 03-10-2023, 12:50 PM   #4
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
Thanks.
I replied, but my message doesn't turn up.
DVdm is offline   Reply With Quote
Old 03-10-2023, 02:03 PM   #5
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,095
Karma: 4911876
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
I have found this issue in quite a few books. Sometimes it looks like a previous editor has simply deleted all the hyphens in the book.

I use the "Check Spelling" function and simply scroll down the list of all mis-spelled words and fix them.
Alt-F7 or Tools►Check spelling to access it.
Its a lot easier than trying to check every page of the book.
Karellen is online now   Reply With Quote
Old 03-10-2023, 02:18 PM   #6
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
There's a RegEx function called 'SplitWords' that tries to divide using the dictionary

https://www.mobileread.com/forums/sh...ht=split+words
Post #9 by the master

I've never used it so let me know how it goes


Code:
import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'
phossler is offline   Reply With Quote
Old 03-10-2023, 06:04 PM   #7
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
Quote:
Originally Posted by phossler View Post
There's a RegEx function called 'SplitWords' that tries to divide using the dictionary

https://www.mobileread.com/forums/sh...ht=split+words
Post #9 by the master

I've never used it so let me know how it goes


Code:
import regex
from calibre import replace_entities, prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    def fix_word(m):
        word = m.group()
        if dictionaries.recognized(word):
            return word
        for i in xrange(1, len(word) - 1):
            a, b = word[:i], word[i:]
            if dictionaries.recognized(a) and dictionaries.recognized(b):
                return a + ' ' + b
        return word
    text = replace_entities(match.group(1))
    text = regex.sub(r'\b\w+\b', fix_word, text, flags=regex.VERSION1)
    text = prepare_string_for_xml(text)
    return '>' + text + '<'
Of course I have to replace
Code:
return '>' + text + '<'
with
Code:
return text
but, using the search string
Code:
\b(\w+)\b(?![^<>{}]*[>}])
it only works with words that are in the dictionary - i.e. they are left unchanged. But when I encounter a word that is not in the dic, I get an error
Code:
NameError: name 'xrange' is not defined

Last edited by DVdm; 03-10-2023 at 06:08 PM. Reason: added used search string
DVdm is offline   Reply With Quote
Old 03-10-2023, 08:42 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
change xrange to range in that function
kovidgoyal is offline   Reply With Quote
Old 03-11-2023, 05:45 AM   #9
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
Quote:
Originally Posted by kovidgoyal View Post
change xrange to range in that function
Yes!
DVdm is offline   Reply With Quote
Old 03-11-2023, 05:26 PM   #10
DVdm
Enthusiast
DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.DVdm knows what's going on.
 
Posts: 28
Karma: 25920
Join Date: Oct 2020
Device: Kobo Aura H2O (mark 5)
Quote:
Originally Posted by DVdm View Post
Yes!
Using the regex function, it took me a few hours to fix the book, and then I realised something.
This was a book that I had found as a pdf, which I had converted to epub with Calibre. In an edit session I noticed that there were a bunch of paragraphs with some kind of hardcoded linefeeds. With a general search and replace, I replaced all linefeeds with nothing, effectively deleting them, and then did a global beautifying files.
Stupid. I should have replaced the linefeeds with spaces.
So I retrieved the original pfd from my backups, converted to epub, replaced all linefeeds with a space, and beautified all files. Ready.
Silly me!

Last edited by DVdm; 03-12-2023 at 05:20 AM. Reason: spelling
DVdm is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extra spaces between words Drybonz Conversion 4 12-14-2015 08:15 PM
How to make regex to replace 2 spaces between words, with one space? crankypants Sigil 4 10-29-2015 11:51 AM
Missing spaces between words giwqnbha Calibre 2 10-18-2015 05:24 AM
spaces introduced into middle of words in PDF conversion paulrw PDF 1 11-06-2012 02:59 PM
Troubleshooting can't make any spaces between words in my novel. fantaxy Amazon Kindle 2 08-03-2011 10:38 AM


All times are GMT -4. The time now is 11:48 PM.


MobileRead.com is a privately owned, operated and funded community.