View Single Post
Old 10-29-2011, 07:59 PM   #9
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Are you planning on some sort of conversion for these files - i.e to xhtml or html4? Since they dont have many closing tags and the general markup is pretty bad. All of this regex is Python (PCRE if you remove the mode flags), Notepad++ has a horrible syntax, really not worth the effort.

It's no problem to do what you propose, it just needs something like :
Code:
(<link[^<>]+>|<center>(?=\s*<table)|</?tr>|</?td>|</?table[^<>]*>|<P[^<>]*next[^<>]*>\s*<a[^<>]*>\s*<img[^<>]*>\s*</a>\s*</center>)
It's super messy, but it's the least effort - It assumes you don't have any other tables which you would like to keep! (that could be fixed, but I have a feeling these don't use (m)any tables)

If you are planning to convert, something like :
Quote:
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;I assume this is paragraph text
should become :
Quote:
<p>I assume this is paragraph text</p>
And here's some regex for that :
Code:
Find : (?mi)^(<(?P<tag>br|p)>(?P<spaces>(?:&nbsp;| )+)(?P<paratex>[^\n\r]+)$|^<br>$)
Replace : <p>\g<paratex></p>
It preserves lines as blank paragraphs - it however does not capture the final </p> tag from the bottom of the file. You could clean that up relatively easily with another match.

The next one will do pretty much the same thing for headings.
Code:
Find : (?mi)^\s*<(?P<tag>h\d)[^<>]*>(?P<heading>[^<>]+)(</\1>)?$
Replace :<\g<tag>>\g<heading></\g<tag>>
After that it should be in good enough shape to run through HTML TiDy or whatever.

If you are still going slow on Monday, I'll write something to do this in batches or something - contract work til then :/

Last edited by Serpentine; 10-29-2011 at 08:01 PM. Reason: mention the heading regex
Serpentine is offline   Reply With Quote