![]() |
#1 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
Words Split w/ "id=" Stuff
I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant Code:
ele<a id="page_330"></a>phant Code:
\w<[^/].+?></.+?>\w
SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods: Non-Self-Terminating Tags: Code:
FIND: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2 Code:
FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b) REPLACE: \1\3\2 Last edited by enuddleyarbl; 01-26-2023 at 02:04 PM. Reason: Summarizing results |
![]() |
![]() |
![]() |
#2 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 69,152
Karma: 114842697
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.
|
![]() |
![]() |
![]() |
#3 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
Nope. Moving those tags from the middle of the word is harder than REmoving them. To move them, it looks like I'm back to needing to select the whole front and rear word fragments outside the tags. And, I haven't been able to do that yet.
EDIT: Let me stick some trials in here until I figure something out. First, the OR ("|") is giving me issues with the replacement strings. So, I'm just going to work with the non-self-terminated tags. Second, it looks like I can grab some form of the front/rear word fragments with Code:
\b Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2
Last edited by enuddleyarbl; 01-25-2023 at 06:04 PM. |
![]() |
![]() |
![]() |
#5 | |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,679
Karma: 16000001
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2 & Air/Surface Pro/Kindle PW
|
Quote:
Code:
SEARCH: (\w)(<[^/].+?></.+?>)(\w)|(\w)(<[^/].+?/>)(\w) REPLACE: \2\1\3 -or- \1\3\2 Last edited by Turtle91; 01-25-2023 at 01:42 PM. |
|
![]() |
![]() |
![]() |
#6 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2 EDIT: To the best of my knowledge, the above search should set the first replacement group as starting from the nearest word boundary and running to the starting "<" of the interrupting tags. The second group should be everything from there that's in a <blah></somethingelse> pair. The third group should start from there and run to the next word boundary. The replacement of \1\3\2 sticks the first and last bits of the word together and then appends the tag set afterward. EDIT 2: I had a spurious plus ("+") in the search string for the closing tag. That made it look for at least one character after the "/" and if it didn't find one inside the tag, it happily continued looking until if either found one somewhere else or ran out of paragraph. I think I've fixed it (again). Sorry. Last edited by enuddleyarbl; 01-25-2023 at 05:58 PM. |
|
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,543
Karma: 79828108
Join Date: Apr 2011
Device: pb360
|
This is just an idea, I have no idea whether it is actually easier to implement.
Have you tried moving the initial word fragment to after the tag? |
![]() |
![]() |
![]() |
#8 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
I'm going to go the easy route and not bother putting an OR inside the search. I'll just have two different searches for this. The first will be what I did, above, for non-self-terminated tags. This is the search string for the self-terminated tags:
Code:
(\b\w+?)(<[^/]+?/>)(\w+?\b) Code:
\1\3\2 Code:
\2\1\3 |
![]() |
![]() |
![]() |
#9 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,196
Karma: 11695105
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER. See Daisy.org: "Page Navigation": Quote:
My Solution I'd tackle it using: Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+) Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+) Replace: \2\1 This would convert your examples into: Code:
elephant<span epub:type="pagebreak" id="page_330" title="330"></span> elephant<a id="page_330"></a> What's the Regex Doing? Well, the 1st half is saying:
(Similar with the <a> page number version.) What's the 2nd half doing?
The Replace is saying:
Last edited by Tex2002ans; 01-25-2023 at 07:05 PM. |
||
![]() |
![]() |
![]() |
#10 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 583
Karma: 710178
Join Date: Sep 2013
Device: Kobo Forma
|
I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook. Probably 1) no human ever saw it 2) when they made the ebook there might not have been any standards, and 3) no one ever looks back at the horrible stuff they did in the dark ages to make it better.
Of course, on the glass half-full side of things, if they finagled those page id locations to be in the next space, then when someone referred to a bit of text by page number, an ebook user might not be able to find it. Although, occasionally being half a word off shouldn't be too onerous. EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like: "<span role="doc-pagebreak" id="pg24" aria-label="24"/>" Yet, I don't think I've ever seen anything like it. Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("<span epub:type="pagebreak" id="page_330" title="330"></span>") which seems to be making some kind of effort. At some point (probably about where I've finished re-formatting all the books in my library ![]() Last edited by enuddleyarbl; 01-25-2023 at 07:46 PM. |
![]() |
![]() |
![]() |
#11 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,196
Karma: 11695105
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Code:
There was an ele- -------PAGE 2------- phant in the zoo. Code:
There was an ele<a page="page_2"></a>phant in the zoo. Quote:
![]() For all the latest "Real Page Numbers" (RPNs) stuff, see my post here:
where I link to many of the previous topics. You'll also want to type this into your favorite search engine: Code:
RPNs Tex2002ans site:mobileread.com page numbers Tex2002ans site:mobileread.com - - - For a working sample of EPUB3 page numbers, see Doitu's sample book. And his fantastic Sigil plugin: - - - Quote:
The simple <a> was the EPUB2 method. The <span> + epub:type="pagebreak" is the EPUB3 method. Last edited by Tex2002ans; 01-25-2023 at 08:18 PM. |
|||
![]() |
![]() |
![]() |
#12 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 69,152
Karma: 114842697
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#13 |
the rook, bossing Never.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,654
Karma: 54000003
Join Date: Jun 2017
Location: Ireland
Device: Both Kinds: epub based makes and Kindle
|
What he says ^^^^
If I happen to be fixing formatting I also delete all that junk. Pretty quick using global regex or the delete/edit tag tool. |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil's Infamous "colon" Error on File Split | slowsmile | Sigil | 24 | 10-27-2016 10:45 AM |
Regex or other method to find split quotations "" | CyanBC | Sigil | 9 | 05-14-2013 03:52 PM |
Split long words using the "¬" character (small screens) | DSpider | Workshop | 5 | 03-16-2012 08:09 AM |
George R. R. Martin's "A Dance With Dragons" to be split into separate books. | Exer | General Discussions | 4 | 04-02-2011 09:50 AM |
Any way to revert the "Do No Split On Page Breaks" option? | dsana123 | Calibre | 2 | 07-10-2010 03:37 PM |