MobileRead Forums - View Single Post - Question about alt-text in Access-Aide

KevinH · 05-18-2024, 12:07 PM

Okay, I spent some time investigating this:

1. exiftool requires a C library and or command line subprocess interface so that is out

2. Pillow will allow you to use getxmp() but you need the "defusedxml" module. Luckily defusedxml is pure python and small and can be added to a plugin easily. So using Pillow can be used.

BUT:

Pillow made the insane decision to NOT return the raw XML for post processing, and instead creates some horrible nested dict that contains other dicts and lists and so accessing a single element or even walking the list requires recursion and is a real pain. Especially with all of the namespaces being used in the official example. Talk about an xml namespace nightmare!

And as far as I can tell attribute values are lost, case is mutated, etc. It should have just returned the pure xml since there are many xml parsers and tools like bs4 that could be used to get what is needed.

Especially when you do not know the exact namespaces or structure employed. And especially if you may need to access to multiple langauge versions of the same alt text.

3. So that just leaves the following I threw together to based on fragments I could find on on the web (stack exchange) glued together with a few pieces of my own:

Code:

import sys
import os
from bs4 import BeautifulSoup

filename = "test.jpg"
f = open(filename, "rb")
d = f.read()
xmp_str = b""

while d:
    xmp_start = d.find(b"<x:xmpmeta")
    xmp_end = d.find(b"</x:xmpmeta")
    xmp_str += d[xmp_start : xmp_end + 12]
    d = d[xmp_end + 12 :]

alt_text_dict = {}

xmpAsXML = BeautifulSoup(xmp_str, 'xml')
if xmpAsXML:
    node = xmpAsXML.find('AltTextAccessibility')
    if node:
        for element in node.find_all('li'):
            # print(element.prefix, element.namespace, element.name, element['xml:lang'], element.text)
            lang = element.get('xml:lang', 'x-default')
            alt_text_dict[lang] = element.text

for k, v in alt_text_dict.items():
    print(k, v)

All of this could be rewritten into a nice routine but ... it literally walks the entire binary data file looking for particular starting strings (which depend on the x prefix namespace being defined) and ending strings. If a different prefix is used, this search will fail.

This is a mess and very very time consuming for large images.

So I will probably have to dig into the Pillow getxmp() implementation code to try to more quickly just extract the xml and not some horrible nested dictionary.

Before doing all of that, I wonder just how many epub images actually have any of the xmp metadata at all?

Otherwise this seems to be an exercise in futility, since the metadata takes up room, all image optimizers I know of (which are regularly run on images before adding them to an epub) remove this metadata completely. Removing all metadata also prevents some image orientation issues.

So not sure if this is worth the work.

What are people's thoughts on this.

05-18-2024, 12:07 PM	#16
KevinH Sigil Developer Posts: 9,074 Karma: 6361556 Join Date: Nov 2009 Device: many	Okay, I spent some time investigating this: 1. exiftool requires a C library and or command line subprocess interface so that is out 2. Pillow will allow you to use getxmp() but you need the "defusedxml" module. Luckily defusedxml is pure python and small and can be added to a plugin easily. So using Pillow can be used. BUT: Pillow made the insane decision to NOT return the raw XML for post processing, and instead creates some horrible nested dict that contains other dicts and lists and so accessing a single element or even walking the list requires recursion and is a real pain. Especially with all of the namespaces being used in the official example. Talk about an xml namespace nightmare! And as far as I can tell attribute values are lost, case is mutated, etc. It should have just returned the pure xml since there are many xml parsers and tools like bs4 that could be used to get what is needed. Especially when you do not know the exact namespaces or structure employed. And especially if you may need to access to multiple langauge versions of the same alt text. 3. So that just leaves the following I threw together to based on fragments I could find on on the web (stack exchange) glued together with a few pieces of my own: Code: import sys import os from bs4 import BeautifulSoup filename = "test.jpg" f = open(filename, "rb") d = f.read() xmp_str = b"" while d: xmp_start = d.find(b"<x:xmpmeta") xmp_end = d.find(b"</x:xmpmeta") xmp_str += d[xmp_start : xmp_end + 12] d = d[xmp_end + 12 :] alt_text_dict = {} xmpAsXML = BeautifulSoup(xmp_str, 'xml') if xmpAsXML: node = xmpAsXML.find('AltTextAccessibility') if node: for element in node.find_all('li'): # print(element.prefix, element.namespace, element.name, element['xml:lang'], element.text) lang = element.get('xml:lang', 'x-default') alt_text_dict[lang] = element.text for k, v in alt_text_dict.items(): print(k, v) All of this could be rewritten into a nice routine but ... it literally walks the entire binary data file looking for particular starting strings (which depend on the x prefix namespace being defined) and ending strings. If a different prefix is used, this search will fail. This is a mess and very very time consuming for large images. So I will probably have to dig into the Pillow getxmp() implementation code to try to more quickly just extract the xml and not some horrible nested dictionary. Before doing all of that, I wonder just how many epub images actually have any of the xmp metadata at all? Otherwise this seems to be an exercise in futility, since the metadata takes up room, all image optimizers I know of (which are regularly run on images before adding them to an epub) remove this metadata completely. Removing all metadata also prevents some image orientation issues. So not sure if this is worth the work. What are people's thoughts on this. Last edited by KevinH; 05-18-2024 at 12:53 PM.