MobileRead Forums - View Single Post - Sigil-0.9.8 Released

KevinH · 03-23-2017, 04:12 PM

You can always compile and include your own tidy html5 library inside your plugin (one plugin author already does this) or you can use Sigil's internal gumbo library. Sigil's gumbo parser is a fork of Google's Gumbo parser that autocorrects using the exact same rules as browsers. Sigil gumbo library is available through the plugin api as well.

FWIW, You also might be better off parsing the file in gumbo first and then doing replacements of the text parts so that you do not break the xhtml syntax by funny search and replaces.

I think there is a simple example of using Sigil's gumbo in the testme plugin that is documented in the latest Plugin framework Developers guide.

Quote:

Originally Posted by notimp

@devs - I have an inside baseball question - but the GUI plugin api has tickled my interest - and I'm trying to gage if it is worth an effort on my part to try something. There is an "ignorant" part of that question, that could be solved by me installing the current build and finding out - but the followup and clarification - not so much, so I'm asking right away.

I've written several regex based "methods" to automate html parsing for htmls/epubs from certain OCR sources - that rely on Sigil having implemented something like tidy (a checker that auto propagates/closes tags that were removed by certain replace all's).

At the beginning of last year, tidy (?) was cut from Sigil with an engine change I believe, and it wasn't "returned" in the immediate months after.

Two questions -

Is html tidy (or something that would serve that function) integrated in current builds? (Not talking about flightcrew - but actual tidy functionality ("fix missing tags on the fly").

If not - would you consider implementing it again, just on the basis of one guy saying that it would be great to have that functionality back.

-

Here is my rational for it - I can't solve my issue with minimal matching - because I need all the complexity in the regex to catch probable formating patterns, and they are complex ones, where minimal matching would catch too early.

I also can't compensate for it - because I don't necessarily know where a closing tag that gets replaced was a "needed" one. So I need some sort of integrity parsing after the replacements are done.

Thank you for the response, and I guess kind consideration.