MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   Problems with Beautifulsoup with custom tags (https://www.mobileread.com/forums/showthread.php?t=338365)

ebray187 03-27-2021 10:11 PM

Problems with Beautifulsoup with custom tags
 
Hi!, i'm having troubles to add a custom tag with my plugin using Beautifulsoup:
The code:
Code:

    html = '<p id="nt3"><sup>[3]</sup> Note 1. <a href="../Text/Section0001.xhtml#nt3">&lt;&lt;</a></p>'
   
    ## BeautifulSoup parser
    soup = BeautifulSoup(html, "html.parser")
    orig_soup = str(soup)
    original_tag = soup.p

    dict_atributes = {"xml:lang" : "la"}
    new_tag = soup.new_tag("i", attrs=dict_atributes)
    new_tag.string = "Ibid"
    original_tag.insert(1, " ")
    original_tag.insert(2, new_tag)
    original_tag.insert(3, ".")
   
    print("OUT:\n" + str(original_tag))

Outside Sigil everything OK:
Code:

$ python test.py
OUT:
<p id="nt3"><sup>[3]</sup> <i xml:lang="la">Ibid</i>. Note 1. <a href="../Text/Section0001.xhtml#nt3">&lt;&lt;</a></p>

But from Sigil i get:
Code:

OUT:
<p id="nt3"><sup>[3]</sup> <i attrs="{'xml:lang': 'la'}">Ibid</i>. Note 1. <a href="../Text/Section0001.xhtml#nt3"><<</a></p>

Any ideas? I can't find info about Beautifulsoup Sigil's implementation.
Thanks!

PS: Using python 3.8 and Sigil 1.4.3

KevinH 03-27-2021 11:49 PM

What are the double xml escaped "<" as part of the text for?

How are getting the OUT?

If you print it from the plugin, it will pass through an xml encode xml decode pass when being returned from the plugin process over stdout as xml. So instead of printing to see this value, simply write to a log file from the plugin so you can see exactly what BeautifulSoup is generating. Here, my guess it is exactly identical to what you see outside, it is just getting unencoded passing back in the stdout xml file from the plugin.

ebray187 03-28-2021 12:37 AM

Quote:

Originally Posted by KevinH (Post 4106910)
What are the double xml escaped "<" as part of the text for?

They are to return to the call of the reference in the text. Like a back button:
Quote:

[1] This is a note in the notes.xhtml chapter of the book. This arrows on the right are to return to the note call in chapter.xhtml. <<
On the xhtml they are in the &lt; form.

Quote:

Originally Posted by KevinH (Post 4106910)
How are getting the OUT?

From the output of the print() function shown in the Plugin Runner. Its output is consistent with the bk.writefile().

Quote:

Originally Posted by KevinH (Post 4106910)
If you print it from the plugin, it will pass through an xml encode xml decode pass when being returned from the plugin process over stdout as xml. So instead of printing to see this value, simply write to a log file from the plugin so you can see exactly what BeautifulSoup is generating. Here, my guess it is exactly identical to what you see outside, it is just getting unencoded passing back in the stdout xml file from the plugin.

I'm getting the same wrong output on a log file:
Quote:

OUT:
<p id="nt3"><sup>[3]</sup> <i attrs="{'xml:lang': 'la'}">Ibid</i>. Note 1. <a href="../Text/Section0001.xhtml#nt3">&lt;&lt;</a></p>
But running it outside Sigil works fine.

Here its the exact code:
Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os, re
import xml.etree.ElementTree as ET

try:
    from sigil_bs4 import BeautifulSoup
except:
    from bs4 import BeautifulSoup

def run(bk):
    html = '<p id="nt3"><sup>[3]</sup> Note 1. <a href="../Text/Section0001.xhtml#nt3">&lt;&lt;</a></p>'
   
    ## BeautifulSoup parser
    soup = BeautifulSoup(html, "html.parser")
    orig_soup = str(soup)
    original_tag = soup.p

    dict_atributes = {"xml:lang" : "la"}
    new_tag = soup.new_tag("i", attrs=dict_atributes)
    new_tag.string = "Ibid"
    original_tag.insert(1, " ")
    original_tag.insert(2, new_tag)
    original_tag.insert(3, ".")
   
    output = "OUT:\n" + str(original_tag)

    f = open("log.txt", "w")
    f.write(output)
    f.close()
   
    print(output)

    return 0

def main():
    html = '<p id="nt3"><sup>[3]</sup> Note 1. <a href="../Text/Section0001.xhtml#nt3">&lt;&lt;</a></p>'
   
    ## BeautifulSoup parser
    soup = BeautifulSoup(html, "html.parser")
    orig_soup = str(soup)
    original_tag = soup.p

    dict_atributes = {"xml:lang" : "la"}
    new_tag = soup.new_tag("i", attrs=dict_atributes)
    new_tag.string = "Ibid"
    original_tag.insert(1, " ")
    original_tag.insert(2, new_tag)
    original_tag.insert(3, ".")
   
    output = "OUT:\n" + str(original_tag)

    f = open("log.txt", "w")
    f.write(output)
    f.close()
   
    print(output)

if __name__ == "__main__":
    sys.exit(main())

Thanks a lot!

KevinH 03-28-2021 12:49 PM

If you compare that to your first post you will see they are not the same. The printed output is showing the &lt; &lt; decoded when it should not be to be safely used.

The issue is you trying to assign an attribute as a dict. It is being converted to what is needed when run outside of the plugin environment but not inside. My guess is the default dict type is different. One may be an ordered dict collection while the other is not.

Have you tried assigning that attribute in a different way? Sigil's internal bs4 version has many modifications to work on older Python 3 versions back to 3.4, so it may be using different types than a recent BS4 version that only runs on a limited set of Python3 versions.

KevinH 03-28-2021 12:53 PM

I did notice this:

Quote:

This is a new feature in Beautiful Soup 4.4.0.)

What if you need to create a whole new tag? The best solution is to call the factory method BeautifulSoup.new_tag():

soup = BeautifulSoup("<b></b>", 'html.parser')
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>
So my guess is that Sigil's internal version is not supporting adding the attributes the way you do with that method.

KevinH 03-28-2021 01:03 PM

Here are alternative ways to add an attribute ...

Quote:

Attributes¶
A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']
# 'boldest'
You can access that dictionary directly as .attrs:

tag.attrs
# {'id': 'boldest'}
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>

del tag['id']
del tag['another-attribute']
tag
# <b>bold</b>

tag['id']
# KeyError: 'id'
tag.get('id')
# None
So I would remove the attrs= parameter on the new tag method, and instead create the tag then either use the new tag in its dict mode to add the attributes needed one by one or assign it to the tags's .attrs if possible.

KevinH 03-28-2021 01:18 PM

I took a peek at the latest BS4 source at launchpad and they have changed how they handle passing the attrs attribute.

So doing it in two steps will be more compliant with other versions of both bs4 and python3 implementations.

ebray187 03-28-2021 01:21 PM

Quote:

Originally Posted by KevinH (Post 4107047)
Here are alternative ways to add an attribute ...



So I would remove the attrs= parameter on the new tag method, and instead create the tag then either use the new tag in its dict mode to add the attributes needed one by one or assign it to the tags's .attrs if possible.

I have troubles with the ":" symbol in the xml:lang

KevinH 03-28-2021 01:28 PM

There is a fully html5 compliant gumbo parser already there as well as a very simple serial parser called quickparser in place, and there is also a html5lib parser as well that is guaranteed to be there in for use by Sigil plugins.

Surely one of those will do what you need. As for using bs4 as long as you split the new_tag creation from attribute addition in that piece, it does work on all versions of BS4 and back to Python 3.4.

DiapDealer 03-28-2021 01:31 PM

It (the colon) should just be a string when used as an attribute name.

tag["xml:lang"] = "la"

to be more compatible with all version of BeautifulSoup

ebray187 03-28-2021 01:41 PM

Quote:

Originally Posted by DiapDealer (Post 4107056)
It (the colon) should just be a string when used as an attribute name.

tag["xml:lang"] = "la"

to be more compatible with all version of BeautifulSoup

Thanks! it was just a typo... sorry:smack:

Thanks KevinH for your help.


All times are GMT -4. The time now is 08:39 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.