Quote:
Originally Posted by enuddleyarbl
OP: I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant
or, the more common, simpler variety:
Code:
ele<a id="page_330"></a>phant
|
This code is bad practice anyway.
If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER.
See
Daisy.org: "Page Navigation":
Quote:
Where do I put the page break if a word is hyphenated across a page?
Place the page marker after the word. Do not retain the print hyphenation and insert the number in the middle of the word.
|
Anyway, remember to KISS (Keep It Simple, Stupid)!
My Solution
I'd tackle it using:
Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+)
Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+)
Replace: \2\1
This would convert your examples into:
Code:
elephant<span epub:type="pagebreak" id="page_330" title="330"></span>
elephant<a id="page_330"></a>
You can also tweak that regex + list of punctuation as needed.
What's the Regex Doing?
Well, the 1st half is saying:
- <span epub:type="pagebreak" = "Hey! Look for any spans with the pagebreak!"
- [^>]+> = "then keep on grabbing everything in the span until you reach the closing bracket."
(Similar with the <a> page number version.)
What's the 2nd half doing?
- \w = "Look for ANY LETTER."
- ” = "Look for any RIGHT QUOTATION MARK"
- /? = "Look for any QUESTION MARK"
- ! = "Look for any EXCLAMATION POINT"
- \. = "Look for any PERIOD"
- + = "Keep grabbing as many of these letters/punctuation as you can."
The Replace is saying:
- \2 = "You know all those letters/punctuation we captured? Yep. Put it first."
- \1 = "You know all those page <span>s or <a> we captured? Yep. Put it after."