I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant
or, the more common, simpler variety:
Code:
ele<a id="page_330"></a>phant
The find string I'm currently using to find these artificial breaks is:
Code:
\w<[^/].+?></.+?>\w
- \w matches any word character (equivalent to [a-zA-Z0-9_])
- < matches the character < with index 6010 (3C16 or 748) literally (case sensitive)
- Match a single character not present in the list below [^/]
- / matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
- . matches any character (except for line terminators)
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
- ></ matches the characters ></ literally (case sensitive)
- . matches any character (except for line terminators)
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
- > matches the character > with index 6210 (3E16 or 768) literally (case sensitive)
- \w matches any word character (equivalent to [a-zA-Z0-9_])
That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions?
SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods:
Non-Self-Terminating Tags:
Code:
FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b)
REPLACE: \1\3\2
Self-Terminating Tags:
Code:
FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b)
REPLACE: \1\3\2