MobileRead Forums - View Single Post

ezdiy · 12-04-2019, 02:11 PM

Quote:

Originally Posted by Markismus

@ezdiy Great! Thank you!
What tags are retained in the conversion? Are color-tags removed? Blockquote, ex, abr?

The tags it interprets and encodes as special values are:

Code:

  v24 = "full_name";
  Str = "?xml";
  v23 = "xdxf";
  v29 = "i";
  v25 = "description";
  v26 = "ar"; // this one for each definition entry
  v27 = "k";
  v28 = "b" // maybe this is for <br> too, due shared prefix?

(see next_tag in disassembly).
Unknown tags, it seems to strip, keeping only the text within - I *think*, not really sure.

Quote:

What is the limiting entity, precisely? I saw with Greek letters, that it isn't bytes: Some accepted entries stayed below 3500 chars, while being 7500 Bytes. But the chars are not exactly 4k either, somewhat less.

&escapes; are unescaped (ie count limit after unescaping first). But it recognizes only lt, gt, quot and amp. All other entities will be put in the output as-is. This may needlessly waste space in the 64k total when there's actually valid utf8 encoding or worse, unknown entities may even not be properly displayed (as opposed to their utf8). The input/output is most certainly utf8 only, as it internally performs utf8-aware language-specific collations. However you must always count underlying bytes, NOT characters. That is, bytes::length() is what matters (per-line and per-entry limits). character length() can be anything and is irrelevant.

All things considered, here's how you determine entry limit:
1. take one <ar> entry
2. unescape all entities, strip all tags, for recognized ones, (i,k,b) count additional byte. Wrap over-long lines with newlines at word boundary.
3. the resulting text is what would get encoded (64k bytes limit per whole <ar> body, 4k bytes per line).

For this to make any sense at all, you should encode the input similiarly, ie: keep only i,k,b tags per <ar>. Convert all entities to utf8, except for lt,gt,amp,quot (that can be done internally by convert).

The limits should be slightly below 64k and 4k (something like 4k-16 and 64k-16), as it depends on some slack space in there internally and the limits are enforced like that, too.

Quote:

That looks a bit like the de-assembler I used as a kid. (I had to hack CGA games to work on my dad's monochrome Hercules graphics card.) What could I look into for that, nowadays?

It's IDA/hexrays.

Quote:

Is there a way to encode for resources? Audio tags for pronunciation? I know Stardict-tools can convert Lingvo audio resources to Stardict format, however, I have no idea how to implement them in xdxf-format, yet. Would be great to use the audio feature of the pocketbook!

Image resources would be nice, too. Maybe with bbencode? I encoded fonts that way into xml when further processing needed it.

No, the format is a dead end for this reason.

Quote:

Originally Posted by Marco77

Ooooh nice work guys~
Suggestion: maybe create an output format for https://github.com/ilius/pyglossary (or penelope, but it's no longer maintained AFAIK) and get rid of that horrible platform-specific and buggy exe?

Patching the exe is much simpler when it's about quick & dirty solutions. Ultimately proper solution is to just use coolreader/koreader, and ditch this dictionary obscurity altogether.