MobileRead Forums - View Single Post

eschwartz · 04-02-2014, 12:58 PM

Quote:

Originally Posted by PeterT

I've spent some time this week looking at things, and unfortunately, removing the span's can not be done in the same manner as the other cleanups are done by this plugin. Many of the cleanups are done by using regular expressions against the raw (x)html files that make up the book. Unfortunately, spans CAN be nested (and in fact, were in some of the kEpubs from Kobo that I was playing with.

The challenge is that while it is trivial to handle a non nested span (something along the lines of

Code:

<span.+?id="kobo[\d.]+.*?>(.*?)<\/span>

it breaks down dramatically when there is an internal span.

The testing book with nested span's I was using contained the following markup

Code:

<p class="indent">
<span id="kobo.114.1">I don’t go in for ‘lawn maintenance’, though — all that weeding and feeding.</span>
<span id="kobo.114.2"> I prefer my ‘weeds’: the clover, which keeps the grass naturally green with its nitrogen-fixing nodules; the daisy, opening and closing each day (its name comes from the Old English <em>daeges <span class="ent1">ē</span>age</em>, meaning ‘the day’s eye’); the little blue-purple <em>Prunella</em>, known as ‘self-heal’, used to treat sore throats, mouth <a id="page_184"></a>ulcers and open wounds — and still used in modern herbal medicine as an astringent for external or internal wounds.</span>
<span id="kobo.114.3"> As Vita Sackville-West said, ‘A weed is only a plant in the wrong place.’ To which we should add: ‘or one for which we haven’t yet discovered the use’.</span>
</p>

Note the nested span in the kobo.114.2 span.

Multiple passes?