My lxml expertise is currently somewhat lacking. Is there a known technique, or sample calibre code I can look at, which can reliably identify matching start & end HTML tags?
My aims are two-fold:
- to create something which will automatically find occurrences of <span class="italic">...</span> and <span class="bold">...</span> and replace them with 'naked' <i>...</i> and <b>...</b> tags.
- to use this as a practical learning exercise to improve my parsing knowledge
P.S. I know Regex can easily be used to convert non-nested occurrences but if possible I'd like to create something which can also reliably handle the nested ones.