Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2012, 01:16 PM   #1
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Regex Solution to hidden href search?

I'm considering seppuku and there is a little voice in the back of my head whispering "someone on that forum probably has the answer and could offer it in their sleep." I'm hoping that whisper in my head is right.

As I've mentioned in other places I am currently working on a large project importing very old epubs into calibre. As I bring them in I am trying to clean them up as much as my skill set will allow. I find that my skills are expanding with nearly every book! Anyway, many of these books probably started life as scanned images put through PDF OCR, converted a bazillion times using every free conversion software known to man, and have acquired code garbage that is becoming one of my own personal demons.

A lot of these books have gone through Word on their way to epub. In searching the forums I am seeing a lot of options for cleaning up the MsoNormal that I'm looking forward to trying. That's not my issue here.

At some point these books had images with "Top," "Back" and "Next" buttons that were links to previous and next chapters or up to the main TOC. I've seen this in LIT files before. Now, however, there are no buttons but some of the links, which are invisible in WYSYWIG, are still active (or are trying to be) but they point to non-existent files on someone presumably long dead's c drive. Because each one is for a different numbered chapter, image, etc., there is no one universal search. They are unique if only by a couple of letters or numbers.

This is an example of what I am faced with at the beginning of every chapter:

Code:
<span class="sgc-5"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Duty%202%20-%20Lord%20Carew%27s%20Bride.html%23chapter_2%23chapter_2"><span class="calibre12"><img alt="Next" border="0" class="calibre13" src="../Images/image001.gif" /></span>
The classes also change frequently, too. The only consistent thing I see in these piles of steaming...

is the "c:/DOCUME~1." I'd like to build a search parameter that would get rid of that entire string in all instances that has "c:/DOCUME~1" but I'm not sure how to write a search for "search for "c:/DOCUME~1" and then delete everything between the span tags" or whatever other solution would work.

Did I just make any sense at all or am I shopping for ritual knives tomorrow?


Any suggestions?
MizSuz is offline   Reply With Quote
Old 09-29-2012, 02:25 PM   #2
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
Try to search for
<span[^>]*?><a[^>]*?c:/DOCUME.*?.gif" /></span>
and replace it with nothing.
WS64 is offline   Reply With Quote
Old 09-29-2012, 03:01 PM   #3
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Quote:
Originally Posted by WS64 View Post
Try to search for
<span[^>]*?><a[^>]*?c:/DOCUME.*?.gif" /></span>
and replace it with nothing.
No matches found. :/

Thanks for trying.
MizSuz is offline   Reply With Quote
Old 09-29-2012, 03:04 PM   #4
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
I copied your code snippet into Sigil, and it was found by my regex.
WS64 is offline   Reply With Quote
Old 09-29-2012, 03:10 PM   #5
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,659
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by MizSuz View Post
No matches found. :/

Thanks for trying.
did you use Regex searching mode ?
I do tend to escape special characters on General Principals , but in many cases, it is not required
theducks is offline   Reply With Quote
Old 09-29-2012, 03:16 PM   #6
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Quote:
Originally Posted by theducks View Post
did you use Regex searching mode ?
I do tend to escape special characters on General Principals , but in many cases, it is not required

I used the find and replace that comes up at the bottom of the screen when you type ctrl+F.

Is there another?
MizSuz is offline   Reply With Quote
Old 09-29-2012, 03:16 PM   #7
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else). For example:

anchors that seem to refer to a local filesystem (dos style) - note your example did not have a closing a tag?
Code:
<a\b[^<>]*?[[:alpha:]]:/[^<>]*?>
images which refer to next/prev
Code:
<img[^<>]*?alt="(next|prev)"[^<>]*/>
empty tags (leaves tags containing nbsp, to be safe)
Code:
<(\w+)\b[^<>]*>\s*</\1>
Added bonus:
The next one is better than the simple anchor example above. When you load an epub into sigil, all of the text will be stuck in the Text directory, links that refer to them will use the relative paths, like href="../Text/Blah.xhtml" . This looks for anything which does not start with the .. (one level up), so it also catches references to external content (sites and such, hello watermarks). It will find any tags, as well as stuff inside them - so be careful and grep first.
Code:
<(\w+)\b[^<>]*?(href|src)="(?!\.\.)[^"]*?"[^<>]*?(/>|.*?(?!<\1)</\1>)
Serpentine is offline   Reply With Quote
Old 09-29-2012, 03:19 PM   #8
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Quote:
Originally Posted by WS64 View Post
I copied your code snippet into Sigil, and it was found by my regex.
I'm sure it was operator error (on my part). I edited it out manually and I'm on to the next book. Here it is again, essentially the same thing but as you can see it's not EXACTLY the same. It exists for the same reason, it is trying to do the same things, it appears in the same place in the book, but the code is slightly different. This is a good example of what I'm up against. Each book has this stuff at the beginning of every chapter but it's never exactly the same code in each book nor is it exactly the same line from chapter to chapter.

Code:
<p class="MsoNormal4"><span class="calibre1"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23contents%23contents"><span class="calibre14"><img alt="Top" border="0" class="calibre15" src="../Images/image002.gif" /></span></a><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23chapter_2%23chapter_2"><span class="calibre14"><img alt="Next" border="0" class="calibre16" src="../Images/image003.gif" /></span></a></span></p>
MizSuz is offline   Reply With Quote
Old 09-29-2012, 03:27 PM   #9
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Quote:
Originally Posted by Serpentine View Post
I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else).
Thank you. I'm getting no returns on any of those searches and I'm starting to think that it's got to be something I'm doing or missing. Obviously I need to spend more time brushing up on regex and getting to know the software better before I trouble folks.

Frustration, your middle name is Suz.

Thanks for your trouble, folks. I have bookmarked this link and will continue to refer back to it in the hopes that when I figure out what I'm doing wrong I will be able to make good use of these nuggets you've given me.
MizSuz is offline   Reply With Quote
Old 09-29-2012, 03:44 PM   #10
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Haha, just make sure your search option is set to Regex and not Normal, and that you're searching the current file/s.

I've switched and forgotten a number of times
Serpentine is offline   Reply With Quote
Old 09-29-2012, 03:50 PM   #11
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,587
Karma: 2089838
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@Serpentine - there is a little secret trick not many will know about that in the 0.5.905 beta. If you *ctrl* click on the Find/Replace/Replace All/Count buttons then it will force the scope to be current file, without changing the dropdown. So I permanently leave my dropdown set to all files, and then on the rare occasion I want to reduce the scope (like replacing within a stylesheet) I just ctrl+click on the buttons. That way I don't accidentally forget to restore the scope dropdown afterwards...
kiwidude is online now   Reply With Quote
Old 09-29-2012, 04:29 PM   #12
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,441
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by kiwidude View Post
@Serpentine - there is a little secret trick not many will know about that in the 0.5.905 beta. If you *ctrl* click on the Find/Replace/Replace All/Count buttons then it will force the scope to be current file, without changing the dropdown. So I permanently leave my dropdown set to all files, and then on the rare occasion I want to reduce the scope (like replacing within a stylesheet) I just ctrl+click on the buttons. That way I don't accidentally forget to restore the scope dropdown afterwards...
That is just too handy.
You guys have really outdone yourselves.
DiapDealer is offline   Reply With Quote
Old 09-29-2012, 04:40 PM   #13
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
Quote:
Originally Posted by MizSuz View Post
Here it is again, essentially the same thing but as you can see it's not EXACTLY the same. It exists for the same reason, it is trying to do the same things, it appears in the same place in the book, but the code is slightly different. This is a good example of what I'm up against. Each book has this stuff at the beginning of every chapter but it's never exactly the same code in each book nor is it exactly the same line from chapter to chapter.

Code:
<p class="MsoNormal4"><span class="calibre1"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23contents%23contents"><span class="calibre14"><img alt="Top" border="0" class="calibre15" src="../Images/image002.gif" /></span></a><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23chapter_2%23chapter_2"><span class="calibre14"><img alt="Next" border="0" class="calibre16" src="../Images/image003.gif" /></span></a></span></p>
And again my regex finds it...
WS64 is offline   Reply With Quote
Old 09-29-2012, 06:30 PM   #14
MizSuz
Connoisseur
MizSuz began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Quote:
Originally Posted by Serpentine View Post
Haha, just make sure your search option is set to Regex and not Normal, and that you're searching the current file/s.

I've switched and forgotten a number of times



I can not believe that I didn't know how to switch modes in the search. You guys kept saying to make sure and I kept pilfering through the menus and editors thinking "how the heck do I know if I'm in regex or not? Isn't it all dependent on the search string?"

kiwidude posted about that really kewl ctrl+click feature and I had a lightbulb moment about the drop downs actually in the find/replace box. There it is, on the left where it has always said "normal."

*sigh* I've manually edited all the instances out of the piece I'm working on right now so I can't try it right away, but I have several set aside and marked "format" so I'm sure I'll be able to get to it this evening and try out your wonderful suggestions.

Thank you, not only for your time but also for your patience.
MizSuz is offline   Reply With Quote
Old 09-29-2012, 07:13 PM   #15
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,659
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by MizSuz View Post
I used the find and replace that comes up at the bottom of the screen when you type ctrl+F.

Is there another?
No, but it has a pull down that picks which mode it runs in.
Normal, Regex or spell check (this is tricky..read the instructions carefully. It got me at first )
theducks is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
regex search/replace Sharlene Sigil 10 01-28-2012 04:14 AM
Search & Replace/Regex help!! millertime13 Conversion 4 07-22-2011 02:40 AM
Help with regex POSIX class search bfollowell Sigil 7 05-21-2011 10:55 AM
need regex help search and replace schuster Calibre 4 01-10-2011 09:00 AM
regex search for roman numerals Blurr Calibre 2 12-16-2009 05:55 PM


All times are GMT -4. The time now is 01:00 AM.


MobileRead.com is a privately owned, operated and funded community.