View Single Post
Old 09-29-2012, 01:16 PM   #1
MizSuz began at the beginning.
Posts: 41
Karma: 10
Join Date: Jul 2011
Device: Sony Touch, Nook Simple Touch, Kobo Aura, Android w/CoolReader
Regex Solution to hidden href search?

I'm considering seppuku and there is a little voice in the back of my head whispering "someone on that forum probably has the answer and could offer it in their sleep." I'm hoping that whisper in my head is right.

As I've mentioned in other places I am currently working on a large project importing very old epubs into calibre. As I bring them in I am trying to clean them up as much as my skill set will allow. I find that my skills are expanding with nearly every book! Anyway, many of these books probably started life as scanned images put through PDF OCR, converted a bazillion times using every free conversion software known to man, and have acquired code garbage that is becoming one of my own personal demons.

A lot of these books have gone through Word on their way to epub. In searching the forums I am seeing a lot of options for cleaning up the MsoNormal that I'm looking forward to trying. That's not my issue here.

At some point these books had images with "Top," "Back" and "Next" buttons that were links to previous and next chapters or up to the main TOC. I've seen this in LIT files before. Now, however, there are no buttons but some of the links, which are invisible in WYSYWIG, are still active (or are trying to be) but they point to non-existent files on someone presumably long dead's c drive. Because each one is for a different numbered chapter, image, etc., there is no one universal search. They are unique if only by a couple of letters or numbers.

This is an example of what I am faced with at the beginning of every chapter:

<span class="sgc-5"><a class="calibre8" href="c:/DOCUME~1/VALUED~1/My%20Documents/My%20Library/Mary%20Balogh%20-%20Duty%202%20-%20Lord%20Carew%27s%20Bride.html%23chapter_2%23chapter_2"><span class="calibre12"><img alt="Next" border="0" class="calibre13" src="../Images/image001.gif" /></span>
The classes also change frequently, too. The only consistent thing I see in these piles of steaming...

is the "c:/DOCUME~1." I'd like to build a search parameter that would get rid of that entire string in all instances that has "c:/DOCUME~1" but I'm not sure how to write a search for "search for "c:/DOCUME~1" and then delete everything between the span tags" or whatever other solution would work.

Did I just make any sense at all or am I shopping for ritual knives tomorrow?

Any suggestions?
MizSuz is offline   Reply With Quote