Hi,
Its fixable but not just a bug in Mac OS X. Your test case actually has a valid opf. The bug happens when Sigil passes the contents of the file as a full unicode string to sigil_bs4 which doesn't seem to detect that it is full unicode properly because it gets confused by the xml utf-8 header that is at the top of the contents that are passed in.
If I modify the sigil_bs4/builder/_lxml.py to hard code the encoding to full unicode and remove the xml header before we pass it in, the lxml based xml parser all works. Or alternatively, if I encode it the data to utf-8 and keep the xml header, and pass it as bytes, things will work as well.
The sigil_bs4 code seems to take the length of the data passed in BEFORE it converts it to final form (in this case utf-8), and then feeds that to the lxml parser and runs out of full unicode chars before the end of the file (due to the long dc:description which has many utf-8 bytes per char compared to the number of full unicode chars.
So I can fix this, I just need to figure out what is best for xml files that are pased to python as full unicode before they are handed off to code that does the parsing in sigil_bs4/lxml. It would be a shame to convert from full unicode to utf-8 and then back again for no reason.
I will look into the problem and get a fix today some time.
Take care,
KevinH
|