MobileRead Forums - View Single Post

KevinH · 11-06-2025, 03:01 PM

No but since the gumbo parser is available for plugin use, you should be able to easily create an equivalent that mends xhtml files. Basically you feed the gumbo parser your code and ask for xhtml (or prettyxhml) out and it will generate it. Lookout for how you decide to handle xml headers and any numeric entities since gumbo will strip both out.

But why not just use an automation list and have it run mend for all files and then have it launch your plugin? Or visa-versa. That is what automation was designed for.

Here is the snippet that should mend and prettify an xhtml file contents passed in as a string called "samp" in this example

Code:

    import sigil_gumbo_bs4_adapter as gumbo_bs4

    samp = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en-US">
<head><title>testing & entities</title></head>
<body>
  <p class="first second">this&nbsp;is*the*<i><b>copyright</i></b> symbol "&copy;"</p>
  <p xmlns:xlink="http://www.w3.org/xlink" class="second" xlink:href="http://www.ggogle.com">this used to test atribute namespaces</p>
</body>
</html>
"""

    soup = gumbo_bs4.parse(samp)
    newsamp =  soup.prettyprint_xhtml()

If you just want to mend but not prettify that last line should be:

newsamp = soup.serialize_xhtml()

You will still have to replace any chars you wanted as numeric entities as all numeric entities are replaced by their unicode equivalents when parsed.

11-06-2025, 03:01 PM	#2
KevinH Sigil Developer Posts: 9,089 Karma: 6379704 Join Date: Nov 2009 Device: many	No but since the gumbo parser is available for plugin use, you should be able to easily create an equivalent that mends xhtml files. Basically you feed the gumbo parser your code and ask for xhtml (or prettyxhml) out and it will generate it. Lookout for how you decide to handle xml headers and any numeric entities since gumbo will strip both out. But why not just use an automation list and have it run mend for all files and then have it launch your plugin? Or visa-versa. That is what automation was designed for. Here is the snippet that should mend and prettify an xhtml file contents passed in as a string called "samp" in this example Code: import sigil_gumbo_bs4_adapter as gumbo_bs4 samp = """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en-US"> <head><title>testing & entities</title></head> <body> <p class="first second">this isthe<i><b>copyright</i></b> symbol "©"</p> <p xmlns:xlink="http://www.w3.org/xlink" class="second" xlink:href="http://www.ggogle.com">this used to test atribute namespaces</p> </body> </html> """ soup = gumbo_bs4.parse(samp) newsamp = soup.prettyprint_xhtml() If you just want to mend but not prettify that last line should be: newsamp = soup.serialize_xhtml() You will still have to replace any chars you wanted as numeric entities as all numeric entities are replaced by their unicode equivalents when parsed. Last edited by KevinH; 11-06-2025 at 03:27 PM.