MobileRead Forums - View Single Post

KevinH · 01-01-2018, 05:15 PM

You would have to parse the file using the gumbo bs4 adapter, the each node of the parse tree is given extra information fields:

Code:

def _add_source_info(obj, original_text, start_pos, end_pos):
    obj.original = _fromutf8(bytes(original_text))
    obj.line = start_pos.line
    obj.col = start_pos.column
    obj.offset = start_pos.offset
    if end_pos:
        obj.end_line = end_pos.line
        obj.end_col = end_pos.column
        obj.end_offset = end_pos.offset

See:

https://github.com/Sigil-Ebook/Sigil...bs4_adapter.py

And from the testme3 plugin posted at the start of this thread is how to use the gumbo parser:

Code:

# examples for using the bs4/gumbo parser to process xhtml
    print("\nExercising: the gumbo bs4 adapter")
    import sigil_gumbo_bs4_adapter as gumbo_bs4
    samp = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en-US">
<head><title>testing & entities</title></head>
<body>
  <p class="first second">this&nbsp;is*the*<i><b>copyright</i></b> symbol "&copy;"</p>
  <p xmlns:xlink="http://www.w3.org/xlink" class="second" xlink:href="http://www.ggogle.com">this used to test atribute namespaces</p>
</body>
</html>
"""
    soup = gumbo_bs4.parse(samp)
    for node in soup.find_all(attrs={'class':'second'}):
        print(node)

So you should be able to access them via node.line, node.col, and node.offset but I can not prove that now as all I have access to is my old iPad.

Please give that a try.