Quote:
Originally Posted by cybmole
but I think your code misses the fact that there's both a chapter number and a chapter title within the dross, both of which should ideally be salvaged.
|
Yes and no, it disregards attributes which are important if you do not want to regenerate a Table of Contents; however it's usually lot safer to regenerate, since Sigil does this pretty well, using the text between the <h> tags.
On a more general note : if you're like me and you just like extremely simple - near plain html - books, something that is quite handy would be - this is rather dangerous - read and understand it first.
JGsoft syntax:
(?<=</?(h\d|[uod]l|[uisbqpa]|hr|abbr|acronym|address|area|base|basefont|bdo|bi g|blockquote|body|button|caption|center|cite|code| col|colgroup|dd|del|dfn|dir|div|dt|em|fieldset|fon t|hr|ins|kbd|label|legend|li|map|object|param|pre| samp|script|select|small|span|strike|strong|sub|su p|table|tbody|td|textarea|tfoot|th|thead|title|tr| tt|var))\s[^<>/]*(?=/?>)
replace : blank
perl compatible (i.e Python) syntax:
(</?)(h\d|[uod]l|[uisbqpa]|hr|abbr|acronym|address|area|base|basefont|bdo|bi g|blockquote|body|button|caption|center|cite|code| col|colgroup|dd|del|dfn|dir|div|dt|em|fieldset|fon t|hr|ins|kbd|label|legend|li|map|object|param|pre| samp|script|select|small|span|strike|strong|sub|su p|table|tbody|td|textarea|tfoot|th|thead|title|tr| tt|var)(\s[^<>/]*)(/?>)
replace : \1\2\4
This will strip all attributes from the html tags - i.e :
<p class="calibre2"><span class="blarg">This is some text</span></p>
becomes:
<p><span>This is some text</span></p>
You can then apply whatever styles you want directly to all elements - however you usually need two <p> styles - one indented and one flush. Also note that it will remove location markers from your <h1/2..x> headers, so only use this if you plan on regenerating the ToC. You can remove tags from the 'or' list to avoid them entirely - I most likely have forgotten a few header ones in there.
If you know the book you're working with also does not contain any 'useful' formatting in the spans, you can use something like : </?span[^/>]*> to remove them all. But read the CSS first, often they are used only to apply italic/bold/underlines - in which case convert those first to their html tags like <i>.
All in all it's usually easier to just use the HTMLZ with the CSS set to use tags from the get-go