MobileRead Forums - View Single Post

EastEriq · 10-06-2020, 05:31 PM

And these are probably a bit better, I used --disable-trim and the following substitution rules to kill all tags but <U> (used for grammar), <M> (sillabation, phonetic) <L> (resolve abbreviation) and <F> (acception)

Code:

# remove most of the tags but not all from lingoes hebrew dictionaries

cat $1 |\
  sed -e "s|<U>|<br\><i>|g" \
      -e "s|</U>|</i> |g" \
      -e "s|<M>| <b>|g" \
      -e "s|</M>|</b> |g" \
      -e "s|<F>|<br\>‣|g" \
      -e "s|</F>| |g" \
      -e "s|<L>| •|g" \
      -e "s|</L>| |g" \
      -e "s|<[/]*[NCIÒ]>||g" \
      -e "s|<H>|<span>|g" \
      -e 's|<H J="rtl">|<span dir="rtl">|g' \
      -e 's|<H J="rtl" />||g' \
      -e 's|<H />||g' \
      -e "s|</H>|</span>|g" \
      -e 's|\\\"|״|g' \
      -e "s|&gt;&gt;|←|g" \
> $2

ETA: Still within the lines of plain substitutions and not a real XML parser, I've improved a little the treatment of tags, for a better formatting in KR, and caused pyglossary to recognise correctly the dictionary languages. The remaining issue I see is a bidi one, parentheses around Hebrew text in the Hebrew-English dictionary are misplaced, even if they fall inside a <span dir="rtl"></span>