Quote:
Originally Posted by DenS
Hi @nezih. I ran your script at the windows prompt and was able to convert a .html dictionary to .xml. Next I used pyglossary to convert the .xml to stardict(.ifo). It worked great, Thanks!
But there is a dictionary, actually what I needed most, which I can't convert to .xml. The command I use at the prompt is this:
Code:
mobi2stardict.py --html-file "book.html" --fix-links --dict-name "Grande Dicionário de Português" --author "Porto Editora" --textual --chunked
And the prompt gives me this error:
Code:
Traceback (most recent call last):
File "D:\Downloads\mobi2stardict\mobi2stardict.py", line 160, in <module>
convert(args.html_file, args.dict_name, args.author, args.fix_links, args.gls, args.textual, args.chunked)
File "D:\Downloads\mobi2stardict\mobi2stardict.py", line 115, in convert
key = ET.SubElement(article, "key").text = entry.HW
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src\lxml\etree.pyx", line 1042, in lxml.etree._Element.text.__set__
File "src\lxml\apihelpers.pxi", line 748, in lxml.etree._setNodeText
File "src\lxml\apihelpers.pxi", line 736, in lxml.etree._createTextNode
File "src\lxml\apihelpers.pxi", line 1541, in lxml.etree._utf8
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
It might be useful to say that to extract the .mobi dictionary to .html I used the KindleUnpack caliber plugin.
To install BeautifulSoup and lxml I used the commands "pip install beautifulsoup4" and "pip install lxml". The Python version I'm using is 3.11.2.
Could you help me figure out what I'm doing wrong?
|
Hi, If I remember correctly, I came across this problem recently. Most probably headwords include control characters. If you choose to convert to gls format(--gls) only, it will probably run fine. However, you would still need to substitute those with what they actually intended to show.
Open gls file via vscode, look for control chars. such as
,
etc. (You can use \p{C} in Find) Replace those with the intended characters. For example, in my problematic file, I replaced
with
,
with
.