View Single Post
Old 01-25-2023, 10:06 AM   #1
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 784
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Sage
Words Split w/ "id=" Stuff

I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant
or, the more common, simpler variety:
Code:
ele<a id="page_330"></a>phant
The find string I'm currently using to find these artificial breaks is:
Code:
\w<[^/].+?></.+?>\w
  • \w matches any word character (equivalent to [a-zA-Z0-9_])
  • < matches the character < with index 6010 (3C16 or 748) literally (case sensitive)
  • Match a single character not present in the list below [^/]
  • / matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
  • . matches any character (except for line terminators)
  • +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
  • ></ matches the characters ></ literally (case sensitive)
  • . matches any character (except for line terminators)
  • +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
  • > matches the character > with index 6210 (3E16 or 768) literally (case sensitive)
  • \w matches any word character (equivalent to [a-zA-Z0-9_])
That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions?

SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods:

Non-Self-Terminating Tags:
Code:
FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b)
REPLACE: \1\3\2
Self-Terminating Tags:
Code:
FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b)
REPLACE: \1\3\2

Last edited by enuddleyarbl; 02-16-2023 at 08:54 PM. Reason: Summarizing results
enuddleyarbl is offline   Reply With Quote