![]() |
#1 |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
Regex Solution to hidden href search?
I'm considering seppuku and there is a little voice in the back of my head whispering "someone on that forum probably has the answer and could offer it in their sleep." I'm hoping that whisper in my head is right.
As I've mentioned in other places I am currently working on a large project importing very old epubs into calibre. As I bring them in I am trying to clean them up as much as my skill set will allow. I find that my skills are expanding with nearly every book! Anyway, many of these books probably started life as scanned images put through PDF OCR, converted a bazillion times using every free conversion software known to man, and have acquired code garbage that is becoming one of my own personal demons. A lot of these books have gone through Word on their way to epub. In searching the forums I am seeing a lot of options for cleaning up the MsoNormal that I'm looking forward to trying. That's not my issue here. At some point these books had images with "Top," "Back" and "Next" buttons that were links to previous and next chapters or up to the main TOC. I've seen this in LIT files before. Now, however, there are no buttons but some of the links, which are invisible in WYSYWIG, are still active (or are trying to be) but they point to non-existent files on someone presumably long dead's c drive. Because each one is for a different numbered chapter, image, etc., there is no one universal search. They are unique if only by a couple of letters or numbers. This is an example of what I am faced with at the beginning of every chapter: Code:
<span class="sgc-5"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Duty%202%20-%20Lord%20Carew%27s%20Bride.html%23chapter_2%23chapter_2"><span class="calibre12"><img alt="Next" border="0" class="calibre13" src="../Images/image001.gif" /></span> is the "c:/DOCUME~1." I'd like to build a search parameter that would get rid of that entire string in all instances that has "c:/DOCUME~1" but I'm not sure how to write a search for "search for "c:/DOCUME~1" and then delete everything between the span tags" or whatever other solution would work. Did I just make any sense at all or am I shopping for ritual knives tomorrow? ![]() Any suggestions? |
![]() |
![]() |
![]() |
#2 |
♫
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 661
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
Try to search for
<span[^>]*?><a[^>]*?c:/DOCUME.*?.gif" /></span> and replace it with nothing. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
|
![]() |
![]() |
![]() |
#4 |
♫
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 661
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
I copied your code snippet into Sigil, and it was found by my regex.
|
![]() |
![]() |
![]() |
#5 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
|
![]() |
![]() |
![]() |
#7 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
I'd suggest not killing too many tags in a single go, while it can work well, if you make mistakes it's often pretty costly. Rather I'd use a few simpler expressions to weed out the unwanted elements, then finally do a cleaning pass for empty spans (or anything else). For example:
anchors that seem to refer to a local filesystem (dos style) - note your example did not have a closing a tag? Code:
<a\b[^<>]*?[[:alpha:]]:/[^<>]*?> Code:
<img[^<>]*?alt="(next|prev)"[^<>]*/> Code:
<(\w+)\b[^<>]*>\s*</\1> The next one is better than the simple anchor example above. When you load an epub into sigil, all of the text will be stuck in the Text directory, links that refer to them will use the relative paths, like href="../Text/Blah.xhtml" . This looks for anything which does not start with the .. (one level up), so it also catches references to external content (sites and such, hello watermarks). It will find any tags, as well as stuff inside them - so be careful and grep first. Code:
<(\w+)\b[^<>]*?(href|src)="(?!\.\.)[^"]*?"[^<>]*?(/>|.*?(?!<\1)</\1>) |
![]() |
![]() |
![]() |
#8 | |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
Quote:
Code:
<p class="MsoNormal4"><span class="calibre1"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23contents%23contents"><span class="calibre14"><img alt="Top" border="0" class="calibre15" src="../Images/image002.gif" /></span></a><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Stapleton%202%20-%20A%20Precious%20Jewel.html%23chapter_2%23chapter_2"><span class="calibre14"><img alt="Next" border="0" class="calibre16" src="../Images/image003.gif" /></span></a></span></p> |
|
![]() |
![]() |
![]() |
#9 | |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
Quote:
Frustration, your middle name is Suz. Thanks for your trouble, folks. I have bookmarked this link and will continue to refer back to it in the hopes that when I figure out what I'm doing wrong I will be able to make good use of these nuggets you've given me. |
|
![]() |
![]() |
![]() |
#10 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Haha, just make sure your search option is set to Regex and not Normal, and that you're searching the current file/s.
I've switched and forgotten a number of times ![]() |
![]() |
![]() |
![]() |
#11 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,720
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Serpentine - there is a little secret trick not many will know about that in the 0.5.905 beta. If you *ctrl* click on the Find/Replace/Replace All/Count buttons then it will force the scope to be current file, without changing the dropdown. So I permanently leave my dropdown set to all files, and then on the rare occasion I want to reduce the scope (like replacing within a stylesheet) I just ctrl+click on the buttons. That way I don't accidentally forget to restore the scope dropdown afterwards...
|
![]() |
![]() |
![]() |
#12 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,341
Karma: 203719646
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
You guys have really outdone yourselves. ![]() |
|
![]() |
![]() |
![]() |
#13 | |
♫
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 661
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
Quote:
|
|
![]() |
![]() |
![]() |
#14 | |
Connoisseur
![]() Posts: 63
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
|
Quote:
![]() ![]() ![]() ![]() ![]() ![]() ![]() I can not believe that I didn't know how to switch modes in the search. You guys kept saying to make sure and I kept pilfering through the menus and editors thinking "how the heck do I know if I'm in regex or not? Isn't it all dependent on the search string?" kiwidude posted about that really kewl ctrl+click feature and I had a lightbulb moment about the drop downs actually in the find/replace box. There it is, on the left where it has always said "normal." *sigh* I've manually edited all the instances out of the piece I'm working on right now so I can't try it right away, but I have several set aside and marked "format" so I'm sure I'll be able to get to it this evening and try out your wonderful suggestions. Thank you, not only for your time but also for your patience. |
|
![]() |
![]() |
![]() |
#15 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Normal, Regex or spell check (this is tricky..read the instructions carefully. It got me at first ![]() |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
regex search/replace | Sharlene | Sigil | 10 | 01-28-2012 04:14 AM |
Search & Replace/Regex help!! | millertime13 | Conversion | 4 | 07-22-2011 02:40 AM |
Help with regex POSIX class search | bfollowell | Sigil | 7 | 05-21-2011 10:55 AM |
need regex help search and replace | schuster | Calibre | 4 | 01-10-2011 09:00 AM |
regex search for roman numerals | Blurr | Calibre | 2 | 12-16-2009 05:55 PM |