View Single Post
Old 04-02-2012, 03:52 PM   #13
mmat1
Berti
mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.mmat1 ought to be getting tired of karma fortunes by now.
 
mmat1's Avatar
 
Posts: 1,197
Karma: 4985964
Join Date: Jan 2012
Location: Zischebattem
Device: Acer Lumiread
Quote:
Originally Posted by sebito View Post
OMG

You make it sound easy ...

I am familiar with sigil, but do not see the functionality at this time. I and I have tried my epub, in fact both you send HTML, are part of the full epub. Are you used regular expressions after the sigil? Do you applied to HTML?
OK, thats in general the strategy
1. I noticed that none of the href-values has a filename, that must be corrected fist. So i merged the 2 files and added a "../Text/015.html" to any "#a\d+?".

2. I split the two files and Sigil corrects the filenames automatically. Some of your links are pointing to an anchor within the same file. Only links which now point to notes.html will be threated in the next steps.

3. I added a "id"-attribute with the same number as the href to any link, which points to "notes.html", preceeding with "t" (within 015.html only).

4. Due to the weird formatting it get's a bit tougher in notes.html. First i replaced "<span class="tpublidisa70">&nbsp;</span>" with "&nbsp;" since i see no point to give a blank a special format and it will make the following regex easier.

5. Regex (in notes.html only)
Code:
Find: <a id="a(\d\d?\d?\d?\d?)(">)</a>(&nbsp;<span class="tpublidisa71">)<a href="../Text/Text.html#a65">(.+?)</a></span>
Replace: <a href="../Text/Text.html#t\1" id="a\1\2\3\4</span></a>
This uses your "<a href="../Text/Text.html#a65">" as endpoint (well in most cases it's just "<a href="">" and tosses it out for good.

done

----------------------------------------------------

Edit: There's no special functiony within Sigil. It's just dividing the job into small steps and usage of regex. It is easy, with a few hundred links. I guess it's still a tedious job with 181000...

Last edited by mmat1; 04-02-2012 at 04:01 PM.
mmat1 is offline   Reply With Quote