MobileRead Forums - View Single Post

Serpentine · 10-29-2011, 08:59 PM

Are you planning on some sort of conversion for these files - i.e to xhtml or html4? Since they dont have many closing tags and the general markup is pretty bad. All of this regex is Python (PCRE if you remove the mode flags), Notepad++ has a horrible syntax, really not worth the effort.

It's no problem to do what you propose, it just needs something like :

Code:

(<link[^<>]+>|<center>(?=\s*<table)|</?tr>|</?td>|</?table[^<>]*>|<P[^<>]*next[^<>]*>\s*<a[^<>]*>\s*<img[^<>]*>\s*</a>\s*</center>)

It's super messy, but it's the least effort - It assumes you don't have any other tables which you would like to keep! (that could be fixed, but I have a feeling these don't use (m)any tables)

If you are planning to convert, something like :

Quote:

<BR>     I assume this is paragraph text

should become :

Quote:

<p>I assume this is paragraph text</p>

And here's some regex for that :

Code:

Find : (?mi)^(<(?P<tag>br|p)>(?P<spaces>(?:&nbsp;| )+)(?P<paratex>[^\n\r]+)$|^<br>$)
Replace : <p>\g<paratex></p>

It preserves lines as blank paragraphs - it however does not capture the final </p> tag from the bottom of the file. You could clean that up relatively easily with another match.

The next one will do pretty much the same thing for headings.

Code:

Find : (?mi)^\s*<(?P<tag>h\d)[^<>]*>(?P<heading>[^<>]+)(</\1>)?$
Replace :<\g<tag>>\g<heading></\g<tag>>

After that it should be in good enough shape to run through HTML TiDy or whatever.

If you are still going slow on Monday, I'll write something to do this in batches or something - contract work til then :/