View Single Post
Old 01-01-2018, 05:15 PM   #252
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,807
Karma: 6000000
Join Date: Nov 2009
Device: many
You would have to parse the file using the gumbo bs4 adapter, the each node of the parse tree is given extra information fields:

Code:
def _add_source_info(obj, original_text, start_pos, end_pos):
    obj.original = _fromutf8(bytes(original_text))
    obj.line = start_pos.line
    obj.col = start_pos.column
    obj.offset = start_pos.offset
    if end_pos:
        obj.end_line = end_pos.line
        obj.end_col = end_pos.column
        obj.end_offset = end_pos.offset
See:

https://github.com/Sigil-Ebook/Sigil...bs4_adapter.py

And from the testme3 plugin posted at the start of this thread is how to use the gumbo parser:

Code:
# examples for using the bs4/gumbo parser to process xhtml
    print("\nExercising: the gumbo bs4 adapter")
    import sigil_gumbo_bs4_adapter as gumbo_bs4
    samp = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en-US">
<head><title>testing & entities</title></head>
<body>
  <p class="first second">this&nbsp;is*the*<i><b>copyright</i></b> symbol "&copy;"</p>
  <p xmlns:xlink="http://www.w3.org/xlink" class="second" xlink:href="http://www.ggogle.com">this used to test atribute namespaces</p>
</body>
</html>
"""
    soup = gumbo_bs4.parse(samp)
    for node in soup.find_all(attrs={'class':'second'}):
        print(node)
So you should be able to access them via node.line, node.col, and node.offset but I can not prove that now as all I have access to is my old iPad.

Please give that a try.
KevinH is online now   Reply With Quote