View Single Post
Old 02-04-2017, 12:59 PM   #15
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,937
Karma: 6361444
Join Date: Nov 2009
Device: many
DiapDealer, I received the same results as you did by renaming the files to .html I was able to load them all but the html_p one.

The problem with the html_p.html file is the lack of a DOCTYPE on the file itself. It seems sigil-gumbo actually repairs differently depending on the DOCTYPE. This was something I did not know but now makes sense.

With no DOCTYPE on the html_p.html file, Sigil literally needs to clean the file twice to get it to a proper clean state. The first pass cleans up a bunch of garbage but not the table in p issue, but it does add the proper DOCTYPE at the end (our Sigil code does that). But without a clear recognized DOCTYPE, gumbo cleans only to heavily transitional html (very weak cleaning).

The second pass will see the DOCTYPE the first pass added, and then proceed to clean up the table in p problem.

If I simply edit html_p.html and add a <!DOCTYPE html> or the epub2 version of that, at the top of the file before trying to load it, gumbo will properly clean everything in one pass.

So it appears that I will need to check for and add in the DOCTYPE inside CleanSource::Mend before passing anything to gumbo so that gumbo will properly repair the whole mess in one pass.

I will keep playing around with this.

Thanks for the test cases.
KevinH is offline   Reply With Quote