View Single Post
Old 10-21-2011, 11:15 AM   #11
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by cybmole View Post
but I think your code misses the fact that there's both a chapter number and a chapter title within the dross, both of which should ideally be salvaged.
Yes and no, it disregards attributes which are important if you do not want to regenerate a Table of Contents; however it's usually lot safer to regenerate, since Sigil does this pretty well, using the text between the <h> tags.

On a more general note : if you're like me and you just like extremely simple - near plain html - books, something that is quite handy would be - this is rather dangerous - read and understand it first.

JGsoft syntax:
(?<=</?(h\d|[uod]l|[uisbqpa]|hr|abbr|acronym|address|area|base|basefont|bdo|bi g|blockquote|body|button|caption|center|cite|code| col|colgroup|dd|del|dfn|dir|div|dt|em|fieldset|fon t|hr|ins|kbd|label|legend|li|map|object|param|pre| samp|script|select|small|span|strike|strong|sub|su p|table|tbody|td|textarea|tfoot|th|thead|title|tr| tt|var))\s[^<>/]*(?=/?>)
replace : blank

perl compatible (i.e Python) syntax:
(</?)(h\d|[uod]l|[uisbqpa]|hr|abbr|acronym|address|area|base|basefont|bdo|bi g|blockquote|body|button|caption|center|cite|code| col|colgroup|dd|del|dfn|dir|div|dt|em|fieldset|fon t|hr|ins|kbd|label|legend|li|map|object|param|pre| samp|script|select|small|span|strike|strong|sub|su p|table|tbody|td|textarea|tfoot|th|thead|title|tr| tt|var)(\s[^<>/]*)(/?>)
replace : \1\2\4

This will strip all attributes from the html tags - i.e :
<p class="calibre2"><span class="blarg">This is some text</span></p>
becomes:
<p><span>This is some text</span></p>

You can then apply whatever styles you want directly to all elements - however you usually need two <p> styles - one indented and one flush. Also note that it will remove location markers from your <h1/2..x> headers, so only use this if you plan on regenerating the ToC. You can remove tags from the 'or' list to avoid them entirely - I most likely have forgotten a few header ones in there.

If you know the book you're working with also does not contain any 'useful' formatting in the spans, you can use something like : </?span[^/>]*> to remove them all. But read the CSS first, often they are used only to apply italic/bold/underlines - in which case convert those first to their html tags like <i>.

All in all it's usually easier to just use the HTMLZ with the CSS set to use tags from the get-go
Serpentine is offline   Reply With Quote