MobileRead Forums - View Single Post

Markismus · 12-08-2024, 06:53 AM

Looking at the lines it seems to be issue with escape characters:

Code:

$ sed -n '275,285p' eng-ukr_Balla_v1.3_reconstructed.xdxf 
</ar>
<ar>
<head><k>A one</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">adj</font></i&gt; <i class="p"><font color="green">амер.</font></i&gt;<i class="p"><font color="green">,</font></i&gt; <i class="p"><font color="green">розм.</font></i&gt;</div>
<div style="margin-left:1em">першокласний, відмінний</div></def>
</ar>
<ar>
<head><k>a posteriori</k></head><def><div style="margin-left:1em"><i class="p"><font color="green">лат.</font></i&gt;</div>
[m1]<font color="darkred"><b&gt;1.</b></font> <i class="p"><font color="green">adj</font></i&gt;
<div style="margin-left:1em">апостеріорний, заснований на досвіді</div>
[m1]<font color="darkred"><b&gt;2.</b></font> <i class="p"><font color="green">adv</font></i&gt;
<div style="margin-left:1em">апостеріорі, емпірично, з досвіду</div></def>

I've switched a few toggles that impact unescaping HTML-characters and the Koreader optimized version looks good. You'll have to test the dic-file yourself.

It's in the ENG-UKR directory on pCloud.

I've also changed the subroutine escapeHTMLStringForced to skip the contents of tags. Due to the 2-factor authentication on Github, I still have to figure out how to push the commits to the remote, though. (Changes pushed to github.)

The new code is:

Code:

our $PossibleTags = qr~/?(def|mbp|c>|c c="|abr>|ex>|kref>|k>|key|rref|f>|!--|!doctype|a|abbr|acronym|address|applet|area|article|aside|audio|b>|b |base|basefont|bb|bdo|big|blockquote|body|/?br|button|canvas|caption|center|cite|code|col|colgroup|command|datagrid|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|eventsource|fieldset|figcaption|figure|font|footer|form|frame|frameset|h[1-6]|head|header|hgroup|hr/|html|i>|i |iframe|img|input|ins|isindex|k|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q>|rp|rt|ruby|s>|samp|script|section|select|small|source|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u>|ul|var|video|wbr)~;
our $HTMLcodes = qr~(lt;|amp;|gt;|quot;|apos;|\#x?[0-9A-Fa-f]{1,6})~;
sub escapeHTMLString{
    my $String = shift;
    unless( $isEscapeHTMLCharacters ){ 
        info_t("returning without escaping '$String'");
        return $String; 
    }
    return( escapeHTMLStringForced($String) );}
sub escapeHTMLStringForced{
    my $String = shift;
    unless( defined $String ){ die2("Undefined string given to escapeHTMLString."); }
    
    # Turn string in array of tags and strings
    my @String;
    while( $String =~ s~^([^<>]*)(<[^<>]+>)~~s ){
        push @String, $1 if defined $1;
        push @String, $2;
    }
    foreach(@String){
    if( m~^<~ ){ next; }
    # Convert '<' to '&lt;', but not if it's part of a HTML tag.
    s~<(?!\/?$PossibleTags[^>]*>)~&lt;~gs;
    # Convert '>' to '&gt;', but not if it's part of a HTML tag.
    s~(?<!<$PossibleTags[^>]{0,100})>~&gt;~sg;
    # Convert '&' to '&amp', but not if is part of an HTML escape sequence.
    s~&(?!$HTMLcodes)~&amp;~gs;
    s~'~\&apos;~sg;
    s~"~\&quot;~sg;
    s~\{~\&$123;~sg;
    s~\?~\&$125;~sg;
    }
    $String = join( '', @String );
    info_t("returning after escaped '$String'");
    return $String;}