View Single Post
Old 10-06-2020, 05:31 PM   #6
EastEriq
Groupie
EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!EastEriq rocks like Gibraltar!
 
Posts: 169
Karma: 100516
Join Date: Jan 2018
Device: Cybook Orizon, PocketBook Touch HD
And these are probably a bit better, I used --disable-trim and the following substitution rules to kill all tags but <U> (used for grammar), <M> (sillabation, phonetic) <L> (resolve abbreviation) and <F> (acception)

Code:
# remove most of the tags but not all from lingoes hebrew dictionaries

cat $1 |\
  sed -e "s|<U>|<br\><i>|g" \
      -e "s|</U>|</i> |g" \
      -e "s|<M>| <b>|g" \
      -e "s|</M>|</b> |g" \
      -e "s|<F>|<br\>‣|g" \
      -e "s|</F>| |g" \
      -e "s|<L>| •|g" \
      -e "s|</L>| |g" \
      -e "s|<[/]*[NCIŅ]>||g" \
      -e "s|<H>|<span>|g" \
      -e 's|<H J="rtl">|<span dir="rtl">|g' \
      -e 's|<H J="rtl" />||g' \
      -e 's|<H />||g' \
      -e "s|</H>|</span>|g" \
      -e 's|\\\"|״|g' \
      -e "s|&gt;&gt;|←|g" \
> $2
ETA: Still within the lines of plain substitutions and not a real XML parser, I've improved a little the treatment of tags, for a better formatting in KR, and caused pyglossary to recognise correctly the dictionary languages. The remaining issue I see is a bidi one, parentheses around Hebrew text in the Hebrew-English dictionary are misplaced, even if they fall inside a <span dir="rtl"></span>
Attached Files
File Type: zip ViconTag.zip (10.28 MB, 360 views)

Last edited by EastEriq; 10-07-2020 at 02:11 PM. Reason: improved formatting
EastEriq is offline   Reply With Quote