View Single Post
Old 12-08-2024, 06:53 AM   #288
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 959
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
Looking at the lines it seems to be issue with escape characters:
Code:
$ sed -n '275,285p' eng-ukr_Balla_v1.3_reconstructed.xdxf 
</ar>
<ar>
<head><k>A one</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">adj</font></i&gt; <i class="p"><font color="green">амер.</font></i&gt;<i class="p"><font color="green">,</font></i&gt; <i class="p"><font color="green">розм.</font></i&gt;</div>
<div style="margin-left:1em">першокласний, відмінний</div></def>
</ar>
<ar>
<head><k>a posteriori</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">лат.</font></i&gt;</div>
[m1]<font color="darkred"><b&gt;1.</b></font> <i class="p"><font color="green">adj</font></i&gt;
<div style="margin-left:1em">апостеріорний, заснований на досвіді</div>
[m1]<font color="darkred"><b&gt;2.</b></font> <i class="p"><font color="green">adv</font></i&gt;
<div style="margin-left:1em">апостеріорі, емпірично, з досвіду</div></def>
I've switched a few toggles that impact unescaping HTML-characters and the Koreader optimized version looks good. You'll have to test the dic-file yourself.

It's in the ENG-UKR directory on pCloud.

I've also changed the subroutine escapeHTMLStringForced to skip the contents of tags. Due to the 2-factor authentication on Github, I still have to figure out how to push the commits to the remote, though. (Changes pushed to github.)

The new code is:
Code:
our $PossibleTags = qr~/?(def|mbp|c>|c c="|abr>|ex>|kref>|k>|key|rref|f>|!--|!doctype|a|abbr|acronym|address|applet|area|article|aside|audio|b>|b |base|basefont|bb|bdo|big|blockquote|body|/?br|button|canvas|caption|center|cite|code|col|colgroup|command|datagrid|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|eventsource|fieldset|figcaption|figure|font|footer|form|frame|frameset|h[1-6]|head|header|hgroup|hr/|html|i>|i |iframe|img|input|ins|isindex|k|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q>|rp|rt|ruby|s>|samp|script|section|select|small|source|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u>|ul|var|video|wbr)~;
our $HTMLcodes = qr~(lt;|amp;|gt;|quot;|apos;|\#x?[0-9A-Fa-f]{1,6})~;
sub escapeHTMLString{
    my $String = shift;
    unless( $isEscapeHTMLCharacters ){ 
        info_t("returning without escaping '$String'");
        return $String; 
    }
    return( escapeHTMLStringForced($String) );}
sub escapeHTMLStringForced{
    my $String = shift;
    unless( defined $String ){ die2("Undefined string given to escapeHTMLString."); }
    
    # Turn string in array of tags and strings
    my @String;
    while( $String =~ s~^([^<>]*)(<[^<>]+>)~~s ){
        push @String, $1 if defined $1;
        push @String, $2;
    }
    foreach(@String){
    if( m~^<~ ){ next; }
    # Convert '<' to '&lt;', but not if it's part of a HTML tag.
    s~<(?!\/?$PossibleTags[^>]*>)~&lt;~gs;
    # Convert '>' to '&gt;', but not if it's part of a HTML tag.
    s~(?<!<$PossibleTags[^>]{0,100})>~&gt;~sg;
    # Convert '&' to '&amp', but not if is part of an HTML escape sequence.
    s~&(?!$HTMLcodes)~&amp;~gs;
    s~'~\&apos;~sg;
    s~"~\&quot;~sg;
    s~\{~\&$123;~sg;
    s~\?~\&$125;~sg;
    }
    $String = join( '', @String );
    info_t("returning after escaped '$String'");
    return $String;}

Last edited by Markismus; 12-28-2024 at 03:11 PM.
Markismus is offline   Reply With Quote