Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 07-27-2014, 02:55 AM   #1
AlanHK
Enthusiast
AlanHK began at the beginning.
 
AlanHK's Avatar
 
Posts: 31
Karma: 10
Join Date: Apr 2014
Device: Android phone
s&r for paired tags

I've got a book file full of code like :

Code:
<p class="calibre2"><span class="none2">blah blah blah</span></p>

Is there a way I can remove these spans, (the "none2" ones) to get

Code:
<p class="calibre2">blah blah blah</p>
without messing up any other spans?

I can remove the opening by simple s&r, but then I would have orphaned </span>, but could not just delete </span> without screwing up other spans.


I could change "<span class="none2">" to "<span>" and neuter them, but I really hate to leave junk code in the file.


-- PS, I know what regex are,and have written some simple ones, but parsing HTML is a bit hairy.

Last edited by AlanHK; 07-27-2014 at 03:23 AM.
AlanHK is offline   Reply With Quote
Old 07-27-2014, 04:18 AM   #2
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 1,937
Karma: 4633610
Join Date: Dec 2010
Device: Kindle PW2
If the spans are not nested the following simple regex should do the trick:

Find:<span class="none2">(.*?)</span>
Replace:\1
Doitsu is offline   Reply With Quote
Old 07-27-2014, 06:01 AM   #3
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 480
Karma: 379907
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by Doitsu View Post
If the spans are not nested the following simple regex should do the trick:
This needs to be stressed. Quite often span tags are NOT nested, and you might accidentally cause a lot of damage if you just do a large "Replace All".

(I have done it many times, and didn't notice until later when I was doing a few cleaning passes). Later wondering "why the heck is this entire paragraph in smallcaps?".

Always save versions of your EPUBs when doing larger edits like this.

For nested tags, you really just need something that can actually PARSE HTML, and not just Regex.
Tex2002ans is offline   Reply With Quote
Old 07-27-2014, 06:33 AM   #4
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,793
Karma: 1089170
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
you could probably regex out only the ones that are adjacent to the P tags
find
<p class="calibre2"><span class="none2">(.*?)</span></p>
replace
<p class="calibre2">\1</p>

but take a backup first- this code will go wrong if you have nested spans!
cybmole is offline   Reply With Quote
Old 07-27-2014, 07:50 AM   #5
AlanHK
Enthusiast
AlanHK began at the beginning.
 
AlanHK's Avatar
 
Posts: 31
Karma: 10
Join Date: Apr 2014
Device: Android phone
Quote:
Originally Posted by Doitsu View Post
If the spans are not nested the following simple regex should do the trick:

Find:<span class="none2">(.*?)</span>
Replace:\1

That should work, thanks.


Quote:
Originally Posted by Tex2002ans View Post
For nested tags, you really just need something that can actually PARSE HTML, and not just Regex.
Well, Sigil can parse HTML. It highlights the tag pairs, for instance. I was hoping there were some options hidden away that I could use to do this. Too bad it doesn't give users more HTML-aware s&r than generic regex.

Is there any HTML code editor that does stuff like this?
AlanHK is offline   Reply With Quote
Old 07-27-2014, 08:30 AM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,793
Karma: 1089170
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure.
calibre editor is your other go-to solution as it is in ongoing development / you can post enhancement requests

NB the spans may look ugly but they are mostly harmless - the book will render OK if you just leave them be!
cybmole is offline   Reply With Quote
Old 07-27-2014, 09:41 AM   #7
eschwartz
Irrational Optimist
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
Posts: 5,738
Karma: 8954186
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch (Wifi only)
Find:
Code:
<span class="none2">((?:(?!<span).)*?)</span>
Replace:
Code:
\1
Using a negative lookahead we search for the LACK of a nested span, followed by any character, then repeat.

Matches nested tags as long as only the outer tag is a span. But you can be more specific if you want, by changing the lookahead.

http://regular-expressions.info/completelines.html

Last edited by eschwartz; 07-27-2014 at 09:50 AM.
eschwartz is online now   Reply With Quote
Old 07-27-2014, 11:56 AM   #8
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 14,607
Karma: 5628865
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Nested Spans are a pain and if you START out with the code in Post 2, you will have a disaster because that is only safe with a simple (and IMHO unnecessary, except it is a conversion simplifier) span as you show.
process
Code:
<p class="calibre2 none2">blah blah blah</p>
should work the same.

I am going give a try to eschwartz's REGEX
theducks is offline   Reply With Quote
Old 07-27-2014, 12:33 PM   #9
AlanHK
Enthusiast
AlanHK began at the beginning.
 
AlanHK's Avatar
 
Posts: 31
Karma: 10
Join Date: Apr 2014
Device: Android phone
Quote:
Originally Posted by cybmole View Post
if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure.
That what I do want. I'm surprised after all these years there isn't something that does. (HTML, not just ebooks).

I use Ultraedit for my text. But I was hoping for more than a text editor that highlights.


Quote:
Originally Posted by theducks View Post
I am going give a try to eschwartz's REGEX
Googled it, "No results found for "eschwartz's REGEX" "

??
AlanHK is offline   Reply With Quote
Old 07-27-2014, 01:46 PM   #10
mrmikel
Book Twiddler
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,019
Karma: 1424479
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions). You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended.

Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace.
mrmikel is offline   Reply With Quote
Old 07-27-2014, 02:16 PM   #11
AlanHK
Enthusiast
AlanHK began at the beginning.
 
AlanHK's Avatar
 
Posts: 31
Karma: 10
Join Date: Apr 2014
Device: Android phone
Quote:
Originally Posted by mrmikel View Post
If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions).
Just my impression that Sigil is (was?) the more code-editing tool, while Calibre the more GUI.

Quote:
Originally Posted by mrmikel View Post
You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended.
Since it's almost every paragraph in a book, that's a few thousand cases. One at a time isn't an option.

Anyway, I worked it out by first finding and fixing the spans I wanted to keep (as it happens, one) and then could delete the rest with a clear conscience.


Quote:
Originally Posted by mrmikel View Post
Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace.
Duh. I thought it was some kind of software. I didn't register the names next to posts.
AlanHK is offline   Reply With Quote
Old 07-27-2014, 02:45 PM   #12
PeterT
Taking a break; Fed up
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 6,754
Karma: 43922916
Join Date: Nov 2007
Location: Toronto
Device: Wife: Touch, Arc, Vox Me: Nexus 7, Glo
You might also check out the forked version of ePub Clean plugin for calibre that has some support for removing SPANs
PeterT is offline   Reply With Quote
Old 07-27-2014, 03:14 PM   #13
signum
Connoisseur
signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.signum is that somebody.
 
Posts: 56
Karma: 45332
Join Date: Aug 2011
Device: none
If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not </span>, the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.
signum is offline   Reply With Quote
Old 07-27-2014, 08:05 PM   #14
eschwartz
Irrational Optimist
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
Posts: 5,738
Karma: 8954186
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch (Wifi only)
Quote:
Originally Posted by signum View Post
If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not </span>, the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.
That is why I prefer using a negative lookahead -- it catches that too.
eschwartz is online now   Reply With Quote
Old 07-28-2014, 09:26 AM   #15
phossler
Addict
phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!phossler is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 312
Karma: 50154
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: kindle
@eschwartz--

Quote:
<span class="none2">((??!<span).)*?)</span>
1. Can you explain how the negative look ahead works, including breaking down the pieces of the RE?

2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in <span class="..."> I'll eventually end up with a lot of <span>.....</span> constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them

Thanks
phossler is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tags & Series RealRedhair Library Management 22 07-22-2014 08:28 AM
Calibre Tags & Aldiko Tags Not the Same Themus Calibre 3 03-21-2012 08:23 PM
Amazon Tags - Popular tags vs Unique tags. chrisanthropic Writers' Corner 6 09-19-2011 11:18 PM
FBReader tags on DR & PC sasilk iRex 0 01-23-2010 01:38 AM


All times are GMT -4. The time now is 12:09 AM.


MobileRead.com is a privately owned, operated and funded community.