View Single Post
Old 09-03-2020, 07:10 AM   #148
geek1011
Wizard
geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.
 
Posts: 2,808
Karma: 7423683
Join Date: May 2016
Location: Ontario, Canada
Device: Kobo Mini, Aura Edition 2 v1, Clara HD
Quote:
Originally Posted by davidfor View Post
But, I am curious to see examples that it makes the HTML worse. If you have some, can you send them to me?
I was referring to things like the recent bug with whitespace (this, IIRC), this one with <br/> tags, and this one with tag ordering. Both of these have been fixed in KTE, but kepubify doesn't have that kind of problem as often since it does more using a normal HTML parser, then outputs valid polyglot XHTML+HTML4+HTML5 no matter what the original was (but it does preserve XML declarations). It's not so much that it makes the HTML worse as that it's processing method is more bug-prone.

Kepubify will also handle EPUB3 HTML5 auto-closing tags correctly (it will close them rather than nest everything after it) (remember that a document like <!DOCTYPE html><html lang="en"><title>Document</title><meta charset="utf-8"><div>test</div><p>paragraph<p><i>another</i> is valid HTML5).

The anything->XHTML+HTML4+HTML5 conversion plus that kepubify parses the HTML as HTML means that it is generally a bit more consistent with browser engine-based renderers wrt parsing bad HTML. I don't have a specific example of this right now, but I've seen a few of these before.

One theoretical example I haven't seen in the wild would be the use of regexps vs the actual tree when cleaning up tags. Basically, this bug would cause issues if there were missing closing tags, it wouldn't remove them if they were self closing (possibly due to processing as pure XML using another tool), and if there happened to be a script dealing with these as strings.

There are also a few differences in span processing like this.

I don't have these issues myself since I usually regenerate the HTML code for my books when I get them. I have a small script which essentially extracts paragraphs and things, converts it to an internal format similar to FB2, then back to EPUB.

Quote:
The conversion is just assuming an input book. It doesn't have to be epub. And the output is largely what would have been seen with a conversion to epub and then the spans added. If this fails, it will be because the input file was rubbish.
Yes, that's correct. Many people seem to expect it not to touch that kind of thing, though.

Last edited by geek1011; 09-03-2020 at 01:42 PM.
geek1011 is offline   Reply With Quote