MobileRead Forums - View Single Post

Tex2002ans · 01-25-2023, 05:53 PM

Quote:

Originally Posted by enuddleyarbl

OP: I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:

Code:

ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant

or, the more common, simpler variety:

Code:

ele<a id="page_330"></a>phant

This code is bad practice anyway.

If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER.

See Daisy.org: "Page Navigation":

Quote:

Where do I put the page break if a word is hyphenated across a page?

Place the page marker after the word. Do not retain the print hyphenation and insert the number in the middle of the word.

Anyway, remember to KISS (Keep It Simple, Stupid)!

My Solution

I'd tackle it using:

Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+)

Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+)

Replace: \2\1

This would convert your examples into:

Code:

elephant<span epub:type="pagebreak" id="page_330" title="330"></span>

elephant<a id="page_330"></a>

You can also tweak that regex + list of punctuation as needed.

What's the Regex Doing?

Well, the 1st half is saying:

<span epub:type="pagebreak" = "Hey! Look for any spans with the pagebreak!"
[^>]+> = "then keep on grabbing everything in the span until you reach the closing bracket."

(Similar with the <a> page number version.)

What's the 2nd half doing?

\w = "Look for ANY LETTER."
” = "Look for any RIGHT QUOTATION MARK"
/? = "Look for any QUESTION MARK"
! = "Look for any EXCLAMATION POINT"
\. = "Look for any PERIOD"
+ = "Keep grabbing as many of these letters/punctuation as you can."

The Replace is saying:

\2 = "You know all those letters/punctuation we captured? Yep. Put it first."
\1 = "You know all those page <span>s or <a> we captured? Yep. Put it after."