Hi eschwartz,
Yes, I agree. That is why we moved away from Tidy (it cleaned too much and did not fully support all html5 tags, svg tags, and math tags). Regular gumbo serialization, changes no whitespace whatsoever. gumbo prettyprinting will now condense multiple whitespace on any allowed tag (and that is quite a specific list to prevent problems) and will trim leading and trailing whitespace from inside p tags only. All other places, people will just have to live with it or edit it by hand.
I am thinking of creating a "Clean-and-Sanitize" plugin using the html5lib's parser and sanitize code and a few things like that just to give people the option of "heavy cleaning" if they end up starting with heaping piles of crap html code (ie. read that they have unpacked an entire mobi7 book to one huge html file with lots of out of date html 3.2 pieces flying around.
Thanks!
KevinH
|