Looking at the lines it seems to be issue with escape characters:
Code:
$ sed -n '275,285p' eng-ukr_Balla_v1.3_reconstructed.xdxf
</ar>
<ar>
<head><k>A one</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">adj</font></i> <i class="p"><font color="green">амер.</font></i><i class="p"><font color="green">,</font></i> <i class="p"><font color="green">розм.</font></i></div>
<div style="margin-left:1em">першокласний, відмінний</div></def>
</ar>
<ar>
<head><k>a posteriori</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">лат.</font></i></div>
[m1]<font color="darkred"><b>1.</b></font> <i class="p"><font color="green">adj</font></i>
<div style="margin-left:1em">апостеріорний, заснований на досвіді</div>
[m1]<font color="darkred"><b>2.</b></font> <i class="p"><font color="green">adv</font></i>
<div style="margin-left:1em">апостеріорі, емпірично, з досвіду</div></def>
I've switched a few toggles that impact unescaping HTML-characters and the Koreader optimized version looks good. You'll have to test the dic-file yourself.
It's in the
ENG-UKR directory on pCloud.
I've also changed the subroutine escapeHTMLStringForced to skip the contents of tags.
Due to the 2-factor authentication on Github, I still have to figure out how to push the commits to the remote, though. (Changes pushed to github.)
The new code is:
Code:
our $PossibleTags = qr~/?(def|mbp|c>|c c="|abr>|ex>|kref>|k>|key|rref|f>|!--|!doctype|a|abbr|acronym|address|applet|area|article|aside|audio|b>|b |base|basefont|bb|bdo|big|blockquote|body|/?br|button|canvas|caption|center|cite|code|col|colgroup|command|datagrid|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|eventsource|fieldset|figcaption|figure|font|footer|form|frame|frameset|h[1-6]|head|header|hgroup|hr/|html|i>|i |iframe|img|input|ins|isindex|k|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q>|rp|rt|ruby|s>|samp|script|section|select|small|source|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u>|ul|var|video|wbr)~;
our $HTMLcodes = qr~(lt;|amp;|gt;|quot;|apos;|\#x?[0-9A-Fa-f]{1,6})~;
sub escapeHTMLString{
my $String = shift;
unless( $isEscapeHTMLCharacters ){
info_t("returning without escaping '$String'");
return $String;
}
return( escapeHTMLStringForced($String) );}
sub escapeHTMLStringForced{
my $String = shift;
unless( defined $String ){ die2("Undefined string given to escapeHTMLString."); }
# Turn string in array of tags and strings
my @String;
while( $String =~ s~^([^<>]*)(<[^<>]+>)~~s ){
push @String, $1 if defined $1;
push @String, $2;
}
foreach(@String){
if( m~^<~ ){ next; }
# Convert '<' to '<', but not if it's part of a HTML tag.
s~<(?!\/?$PossibleTags[^>]*>)~<~gs;
# Convert '>' to '>', but not if it's part of a HTML tag.
s~(?<!<$PossibleTags[^>]{0,100})>~>~sg;
# Convert '&' to '&', but not if is part of an HTML escape sequence.
s~&(?!$HTMLcodes)~&~gs;
s~'~\'~sg;
s~"~\"~sg;
s~\{~\&$123;~sg;
s~\?~\&$125;~sg;
}
$String = join( '', @String );
info_t("returning after escaped '$String'");
return $String;}