MobileRead Forums - View Single Post

nezih · 03-11-2023, 08:13 PM

Quote:

Originally Posted by DenS

Hi @nezih. I ran your script at the windows prompt and was able to convert a .html dictionary to .xml. Next I used pyglossary to convert the .xml to stardict(.ifo). It worked great, Thanks!
But there is a dictionary, actually what I needed most, which I can't convert to .xml. The command I use at the prompt is this:

Code:

mobi2stardict.py --html-file "book.html" --fix-links --dict-name "Grande Dicionário de Português" --author "Porto Editora" --textual --chunked

And the prompt gives me this error:

Code:

Traceback (most recent call last):
  File "D:\Downloads\mobi2stardict\mobi2stardict.py", line 160, in <module>
    convert(args.html_file, args.dict_name, args.author, args.fix_links, args.gls, args.textual, args.chunked)
  File "D:\Downloads\mobi2stardict\mobi2stardict.py", line 115, in convert
    key     = ET.SubElement(article, "key").text = entry.HW
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src\lxml\etree.pyx", line 1042, in lxml.etree._Element.text.__set__
  File "src\lxml\apihelpers.pxi", line 748, in lxml.etree._setNodeText
  File "src\lxml\apihelpers.pxi", line 736, in lxml.etree._createTextNode
  File "src\lxml\apihelpers.pxi", line 1541, in lxml.etree._utf8
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

It might be useful to say that to extract the .mobi dictionary to .html I used the KindleUnpack caliber plugin.
To install BeautifulSoup and lxml I used the commands "pip install beautifulsoup4" and "pip install lxml". The Python version I'm using is 3.11.2.
Could you help me figure out what I'm doing wrong?

Hi, If I remember correctly, I came across this problem recently. Most probably headwords include control characters. If you choose to convert to gls format(--gls) only, it will probably run fine. However, you would still need to substitute those with what they actually intended to show.
Open gls file via vscode, look for control chars. such as

Code:

BEL

,

Code:

ACK

etc. (You can use \p{C} in Find) Replace those with the intended characters. For example, in my problematic file, I replaced

Code:

BEL

with

Code:

ll

,

Code:

ACK

with

Code:

ch

.