MobileRead Forums - View Single Post

enuddleyarbl · 01-25-2023, 10:06 AM

I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:

Code:

ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant

or, the more common, simpler variety:

Code:

ele<a id="page_330"></a>phant

The find string I'm currently using to find these artificial breaks is:

Code:

\w<[^/].+?></.+?>\w

\w matches any word character (equivalent to [a-zA-Z0-9_])
< matches the character < with index 6010 (3C16 or 748) literally (case sensitive)
Match a single character not present in the list below [^/]
/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
. matches any character (except for line terminators)
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
></ matches the characters ></ literally (case sensitive)
. matches any character (except for line terminators)
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
> matches the character > with index 6210 (3E16 or 768) literally (case sensitive)
\w matches any word character (equivalent to [a-zA-Z0-9_])

That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions?

SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods:

Non-Self-Terminating Tags:

Code:

FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b)
REPLACE: \1\3\2

Self-Terminating Tags:

Code:

FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b)
REPLACE: \1\3\2

01-25-2023, 10:06 AM	#1
enuddleyarbl Guru Posts: 784 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Sage	Words Split w/ "id=" Stuff I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance: Code: ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant or, the more common, simpler variety: Code: ele<a id="page_330"></a>phant The find string I'm currently using to find these artificial breaks is: Code: \w<[^/].+?></.+?>\w \w matches any word character (equivalent to [a-zA-Z0-9_]) < matches the character < with index 6010 (3C16 or 748) literally (case sensitive) Match a single character not present in the list below [^/] / matches the character / with index 4710 (2F16 or 578) literally (case sensitive) . matches any character (except for line terminators) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) ></ matches the characters ></ literally (case sensitive) . matches any character (except for line terminators) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) > matches the character > with index 6210 (3E16 or 768) literally (case sensitive) \w matches any word character (equivalent to [a-zA-Z0-9_]) That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions? SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods: Non-Self-Terminating Tags: Code: FIND: (\b\w+?)(<\w.+?></\w?>)(\w+?\b) REPLACE: \1\3\2 Self-Terminating Tags: Code: FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b) REPLACE: \1\3\2 Last edited by enuddleyarbl; 02-16-2023 at 08:54 PM. Reason: Summarizing results*