View Single Post
Old 01-25-2023, 05:53 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by enuddleyarbl View Post
OP: I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:

Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant
or, the more common, simpler variety:

Code:
ele<a id="page_330"></a>phant
This code is bad practice anyway.

If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER.

See Daisy.org: "Page Navigation":

Quote:
Where do I put the page break if a word is hyphenated across a page?

Place the page marker after the word. Do not retain the print hyphenation and insert the number in the middle of the word.
Anyway, remember to KISS (Keep It Simple, Stupid)!

My Solution

I'd tackle it using:

Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+)

Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+)

Replace: \2\1

This would convert your examples into:

Code:
elephant<span epub:type="pagebreak" id="page_330" title="330"></span>

elephant<a id="page_330"></a>
You can also tweak that regex + list of punctuation as needed.

What's the Regex Doing?

Well, the 1st half is saying:
  • <span epub:type="pagebreak" = "Hey! Look for any spans with the pagebreak!"
  • [^>]+> = "then keep on grabbing everything in the span until you reach the closing bracket."

(Similar with the <a> page number version.)

What's the 2nd half doing?
  • \w = "Look for ANY LETTER."
  • ” = "Look for any RIGHT QUOTATION MARK"
  • /? = "Look for any QUESTION MARK"
  • ! = "Look for any EXCLAMATION POINT"
  • \. = "Look for any PERIOD"
  • + = "Keep grabbing as many of these letters/punctuation as you can."

The Replace is saying:
  • \2 = "You know all those letters/punctuation we captured? Yep. Put it first."
  • \1 = "You know all those page <span>s or <a> we captured? Yep. Put it after."

Last edited by Tex2002ans; 01-25-2023 at 06:05 PM.
Tex2002ans is offline   Reply With Quote