Question about alt-text in Access-Aide - Page 2

KevinH · 05-18-2024, 11:07 AM

Okay, I spent some time investigating this:

1. exiftool requires a C library and or command line subprocess interface so that is out

2. Pillow will allow you to use getxmp() but you need the "defusedxml" module. Luckily defusedxml is pure python and small and can be added to a plugin easily. So using Pillow can be used.

BUT:

Pillow made the insane decision to NOT return the raw XML for post processing, and instead creates some horrible nested dict that contains other dicts and lists and so accessing a single element or even walking the list requires recursion and is a real pain. Especially with all of the namespaces being used in the official example. Talk about an xml namespace nightmare!

And as far as I can tell attribute values are lost, case is mutated, etc. It should have just returned the pure xml since there are many xml parsers and tools like bs4 that could be used to get what is needed.

Especially when you do not know the exact namespaces or structure employed. And especially if you may need to access to multiple langauge versions of the same alt text.

3. So that just leaves the following I threw together to based on fragments I could find on on the web (stack exchange) glued together with a few pieces of my own:

Code:

import sys
import os
from bs4 import BeautifulSoup

filename = "test.jpg"
f = open(filename, "rb")
d = f.read()
xmp_str = b""

while d:
    xmp_start = d.find(b"<x:xmpmeta")
    xmp_end = d.find(b"</x:xmpmeta")
    xmp_str += d[xmp_start : xmp_end + 12]
    d = d[xmp_end + 12 :]

alt_text_dict = {}

xmpAsXML = BeautifulSoup(xmp_str, 'xml')
if xmpAsXML:
    node = xmpAsXML.find('AltTextAccessibility')
    if node:
        for element in node.find_all('li'):
            # print(element.prefix, element.namespace, element.name, element['xml:lang'], element.text)
            lang = element.get('xml:lang', 'x-default')
            alt_text_dict[lang] = element.text

for k, v in alt_text_dict.items():
    print(k, v)

All of this could be rewritten into a nice routine but ... it literally walks the entire binary data file looking for particular starting strings (which depend on the x prefix namespace being defined) and ending strings. If a different prefix is used, this search will fail.

This is a mess and very very time consuming for large images.

So I will probably have to dig into the Pillow getxmp() implementation code to try to more quickly just extract the xml and not some horrible nested dictionary.

Before doing all of that, I wonder just how many epub images actually have any of the xmp metadata at all?

Otherwise this seems to be an exercise in futility, since the metadata takes up room, all image optimizers I know of (which are regularly run on images before adding them to an epub) remove this metadata completely. Removing all metadata also prevents some image orientation issues.

So not sure if this is worth the work.

What are people's thoughts on this.

oston · 05-18-2024, 11:46 AM

Many thanks for looking into this, Kevin.

Based on what I have seen with the images I regularly encounter, seeing alt-text entries in image metadata is very unusual.

I work with a small non-profit publisher, creating epub versions of their print books and the books often have images. But this is the first time I have ever seen images that contain alt-text entries in their metadata.

It was nice not having to write alt-text , but it was also not at all difficult to copy and paste the alt-text using the very user-friendly alt text feature in Access-Aide.

So I do not think that we need to proceed with this feature.

Another reason not to do anything relating to alt-text in image metadata is that some of the alt-text entries that I saw more properly belonged in an extended description.
See: https://kb.daisy.org/publishing/docs....html#extended
So much of what is actually needed for accessibility depends on the context in which the image is used. So the alt-text in an image metadata might not be suitable for every context in which the image is used. So it's probably better to work directly with each image to provide the best alt-text in the circumstances.

Thanks again, for looking into this, it was very helpful because it prompted a deeper dive into this entire topic.

Jim

Doitsu · 05-18-2024, 12:45 PM

Quote:

Originally Posted by KevinH

What are people's thoughts on this.

Since IPTC metadata seems to be less commonly used than EXIF metadata, a compromise might be grabbing the ImageDescription EXIF metadata entry with Pillow.

This requires only a few lines of code:

Spoiler:

The code will return the string: A Prince looks out between the bars of a prison window.
(It refers to this image provided by the OP.)

IMHO, automatically extracting some human generated description with Acess-Aide is better than extracting no description at all.

@oston would extracting the ImageDescription information be helpful to you?

oston · 05-18-2024, 01:24 PM

Quote:

Originally Posted by Doitsu

Since IPTC metadata seems to be less commonly used than EXIF metadata, a compromise might be grabbing the
IMHO, automatically extracting some human generated description with Acess-Aide is better than extracting no description at all.

@oston would extracting the ImageDescription information be helpful to you?

Thanks for this information, Doitsu.
I am by no means experienced enough to give a valuable answer. I am just trying to learn as much as I can about making accessible epubs.

In the images I have seen, until I saw this latest set of images, I had not seen any Image Descriptions or alt-text in image meta-data.

But hopefully someone who is very experienced with Image Descriptions and accessibility issues will see this and give a more informed answer.

Sorry that I'm not able to be more helpful.

KevinH · 05-18-2024, 02:58 PM

Using exif ImageDescription would be easy to add to AccessAide if that helps.

FWIW, I am just so disappointed that Pillow did not return the xml in their getxmp() method instead of nested mess of dicts and lists. Really makes accessing specific xmp metadata hard to work with.

KevinH · 05-20-2024, 07:54 AM

The Pillow dev guys nicely gave me a snippet of code that will return the actual xml across all 4 image types that support it now. That makes Pillow the obvious best candidate. So I should be able to query for Alt Text and if not present, fall back to exif ImageDescription.

I think that might be worth adding to a future version of AccessAide.

oston · 05-20-2024, 08:42 AM

Thanks, very much, Kevin. That will be helpful.

DNSB · 05-20-2024, 01:49 PM

It would be handy and possibly, if the gods are kind, save me from manually adding all alt texts.

KevinH · 05-23-2024, 01:57 PM

Access-Aide Version v095 has now been released. It is available via our Sigil Plugin Index as an attachment or from my github repo:

https://github.com/kevinhendricks/Access-Aide

It now includes the ability to take EMPTY alt attributes and look up the image's own metadata for XMP AltTextAccessibility or failing that, exif ImageDescription to auto fill alt attribute values.

It will NOT overwrite any existing image alt value.

Hope this helps,

KevinH

KevinH · 05-23-2024, 02:08 PM

In case anyone else wants to add this feature to their own code, here is the sample code:

Code:

import sys
from bs4 import BeautifulSoup
from PIL import Image

# extract base language from language code
def baselang(lang):
    if len(lang) > 3:
        if lang[2:3] in "-_":
            return lang[0:2]
    return None

def parse_xmpxml_for_alttext(xmpxml):
    xmpmeta = BeautifulSoup(xmpxml, 'xml')
    alt_dict = {}
    if xmpmeta:
        node = xmpmeta.find('AltTextAccessibility')
        if node:
            for element in node.find_all('li'):
                lang = element.get('xml:lang', 'x-default')
                alt_dict[lang] = element.text
                lg = baselang(lang)
                if lg:
                    alt_dict[lg] = element.txt
    return alt_dict


def get_image_metadata_alttext(imgpath, tgtlang):
    xmpxml = None
    description = ""
    with Image.open(imgpath) as im:
        if im.format == 'WebP':
            if "xmp" in im.info:
                xmpxml = im.info["xmp"]
        if im.format == 'PNG':
            if "XML:com.adobe.xmp" in im.info:
                xmpxml = im.info["XML:com.adobe.xmp"]
        if im.format == 'TIFF':
            if 700 in im.tag_v2:
                xmpxml = im.tag_v2[700]
        if im.format == 'JPEG':
            for segment, content in im.applist:
                if segment == "APP1":
                    marker, xmp_tags = content.split(b"\x00")[:2]
                    if marker == b"http://ns.adobe.com/xap/1.0/":
                        xmpxml = xmp_tags
                        break
        exif = im.getexif()
        # 270 = ImageDescription
        if exif and 270 in exif:
            description = exif[270]
    if not xmpxml:
        return description
    alt_dict = parse_xmpxml_for_alttext(xmpxml)
    # first try full language code match
    if tgtlang in alt_dict:
        return alt_dict[tgtlang]
     # next try base language code match
    lg = baselang(tgtlang)
    if lg and lg in alt_dict:
        return alt_dict[lg]
    # use default
    if 'x-default' in alt_dict:
        return alt_dict['x-default']
    # otherwise fall back to exif image description
    return description



imgpath = "test.jpg"
lang = 'en-US'
print(get_image_metadata_alttext(imgpath, lang))

BeckyEbook · 05-23-2024, 02:37 PM

@KevinH: It is essential to add try/except from line 482, as it throws an error if there is no metadata in the image.

Spoiler:

KevinH · 05-23-2024, 02:44 PM

I will check for that key first to prevent the keyerror.

Thanks!

KevinH · 05-23-2024, 03:03 PM

Should now be fixed in v0.9.6 just posted.

Thank you @BeckyEbook!

DNSB · 05-23-2024, 10:06 PM

Tested 0.9.6 on 4 ePubs with images. One worked well since it had decent metadata, one worked on 4 out of 10 images, the last two had no useful metadata. Still going to save me time and effort so thanks very much!

KevinH · 05-24-2024, 12:03 PM

FYI: There is an indentation whitespace issue. So a new version of Access Aide (this time 0.9.7) will be coming later this evening fixing that. It only impacts jpeg images with multiple APP1 segments none of which are xmp metadata.

So the alt_text in your 4 epubs should be correct as is.

Update:

Version 0.9.7 just posted has this new fix. Hopefully the last one.

05-18-2024, 11:07 AM	#16
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	Okay, I spent some time investigating this: 1. exiftool requires a C library and or command line subprocess interface so that is out 2. Pillow will allow you to use getxmp() but you need the "defusedxml" module. Luckily defusedxml is pure python and small and can be added to a plugin easily. So using Pillow can be used. BUT: Pillow made the insane decision to NOT return the raw XML for post processing, and instead creates some horrible nested dict that contains other dicts and lists and so accessing a single element or even walking the list requires recursion and is a real pain. Especially with all of the namespaces being used in the official example. Talk about an xml namespace nightmare! And as far as I can tell attribute values are lost, case is mutated, etc. It should have just returned the pure xml since there are many xml parsers and tools like bs4 that could be used to get what is needed. Especially when you do not know the exact namespaces or structure employed. And especially if you may need to access to multiple langauge versions of the same alt text. 3. So that just leaves the following I threw together to based on fragments I could find on on the web (stack exchange) glued together with a few pieces of my own: Code: import sys import os from bs4 import BeautifulSoup filename = "test.jpg" f = open(filename, "rb") d = f.read() xmp_str = b"" while d: xmp_start = d.find(b"<x:xmpmeta") xmp_end = d.find(b"</x:xmpmeta") xmp_str += d[xmp_start : xmp_end + 12] d = d[xmp_end + 12 :] alt_text_dict = {} xmpAsXML = BeautifulSoup(xmp_str, 'xml') if xmpAsXML: node = xmpAsXML.find('AltTextAccessibility') if node: for element in node.find_all('li'): # print(element.prefix, element.namespace, element.name, element['xml:lang'], element.text) lang = element.get('xml:lang', 'x-default') alt_text_dict[lang] = element.text for k, v in alt_text_dict.items(): print(k, v) All of this could be rewritten into a nice routine but ... it literally walks the entire binary data file looking for particular starting strings (which depend on the x prefix namespace being defined) and ending strings. If a different prefix is used, this search will fail. This is a mess and very very time consuming for large images. So I will probably have to dig into the Pillow getxmp() implementation code to try to more quickly just extract the xml and not some horrible nested dictionary. Before doing all of that, I wonder just how many epub images actually have any of the xmp metadata at all? Otherwise this seems to be an exercise in futility, since the metadata takes up room, all image optimizers I know of (which are regularly run on images before adding them to an epub) remove this metadata completely. Removing all metadata also prevents some image orientation issues. So not sure if this is worth the work. What are people's thoughts on this. Last edited by KevinH; 05-18-2024 at 11:53 AM.

05-18-2024, 11:46 AM	#17
oston Connoisseur Posts: 81 Karma: 2138296 Join Date: Nov 2016 Device: ipad, Kindle Scribe, Kobo Libra 2	Many thanks for looking into this, Kevin. Based on what I have seen with the images I regularly encounter, seeing alt-text entries in image metadata is very unusual. I work with a small non-profit publisher, creating epub versions of their print books and the books often have images. But this is the first time I have ever seen images that contain alt-text entries in their metadata. It was nice not having to write alt-text , but it was also not at all difficult to copy and paste the alt-text using the very user-friendly alt text feature in Access-Aide. So I do not think that we need to proceed with this feature. Another reason not to do anything relating to alt-text in image metadata is that some of the alt-text entries that I saw more properly belonged in an extended description. See: https://kb.daisy.org/publishing/docs....html#extended So much of what is actually needed for accessibility depends on the context in which the image is used. So the alt-text in an image metadata might not be suitable for every context in which the image is used. So it's probably better to work directly with each image to provide the best alt-text in the circumstances. Thanks again, for looking into this, it was very helpful because it prompted a deeper dive into this entire topic. Jim Last edited by oston; 05-18-2024 at 11:48 AM.

05-24-2024, 12:03 PM	#30
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	FYI: There is an indentation whitespace issue. So a new version of Access Aide (this time 0.9.7) will be coming later this evening fixing that. It only impacts jpeg images with multiple APP1 segments none of which are xmp metadata. So the alt_text in your 4 epubs should be correct as is. Update: Version 0.9.7 just posted has this new fix. Hopefully the last one. Last edited by KevinH; 05-24-2024 at 02:22 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Plugin] Access-Aide - help improve epub accessibility	KevinH	Plugins	147	10-15-2024 10:25 AM
Bug: splitting pages after using Access Aide	oston	Sigil	4	04-08-2024 07:59 AM
[Editor Plugins] Access Aide	wolf123	Plugins	5	07-08-2023 01:10 PM
access-aide failure	oston	Sigil	5	06-27-2023 03:42 PM
Alt Text in epub	Lancelot	ePub	3	09-11-2013 03:55 AM

05-18-2024, 02:58 PM	#20
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	Using exif ImageDescription would be easy to add to AccessAide if that helps. FWIW, I am just so disappointed that Pillow did not return the xml in their getxmp() method instead of nested mess of dicts and lists. Really makes accessing specific xmp metadata hard to work with.

05-20-2024, 07:54 AM	#21
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	The Pillow dev guys nicely gave me a snippet of code that will return the actual xml across all 4 image types that support it now. That makes Pillow the obvious best candidate. So I should be able to query for Alt Text and if not present, fall back to exif ImageDescription. I think that might be worth adding to a future version of AccessAide.

05-20-2024, 08:42 AM	#22
oston Connoisseur Posts: 81 Karma: 2138296 Join Date: Nov 2016 Device: ipad, Kindle Scribe, Kobo Libra 2	Thanks, very much, Kevin. That will be helpful.

05-20-2024, 01:49 PM	#23
DNSB Bibliophagist Posts: 46,452 Karma: 169098492 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	It would be handy and possibly, if the gods are kind, save me from manually adding all alt texts.

05-23-2024, 01:57 PM	#24
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	Access-Aide Version v095 has now been released. It is available via our Sigil Plugin Index as an attachment or from my github repo: https://github.com/kevinhendricks/Access-Aide It now includes the ability to take EMPTY alt attributes and look up the image's own metadata for XMP AltTextAccessibility or failing that, exif ImageDescription to auto fill alt attribute values. It will NOT overwrite any existing image alt value. Hope this helps, KevinH

05-23-2024, 02:44 PM	#27
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	I will check for that key first to prevent the keyerror. Thanks!

05-23-2024, 03:03 PM	#28
KevinH Sigil Developer Posts: 8,809 Karma: 6000000 Join Date: Nov 2009 Device: many	Should now be fixed in v0.9.6 just posted. Thank you @BeckyEbook!

05-23-2024, 10:06 PM	#29
DNSB Bibliophagist Posts: 46,452 Karma: 169098492 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	Tested 0.9.6 on 4 ePubs with images. One worked well since it had decent metadata, one worked on 4 out of 10 images, the last two had no useful metadata. Still going to save me time and effort so thanks very much!