View Single Post
Old 11-14-2021, 03:32 PM   #111
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 971
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura
@Getkey Nice screenshot! If you're ready to test, I am willing to program a fix. I haven't had a Pocketbook since 2 years, so most newer features for Pocketbook are as yet untested.

This is easily seen from the tail end of the script. "Create Stardict Dictionary" runs from line 1696 to 1745, while "Create Pocketbook Dictionary" runs from line 1749 to 1756. That ten-fold lines for the tested conversion.

Line 375 actually replaces '&' symbols in the text that are _not_ followed by a html-codepoint with the escape character '&' This to prevent errors in parsers that do look at escape sequences.

These entries seem to use a decimal escape sequence. So 160->nbsp, 233->é, 232->è, etc. However, 9830->♦, which seems odd.

Non-breakable spaces (&160 are in fact converted to codepoints at line 971 and further:
Code:
sub convertNonBreakableSpacetoNumberedSequence{
	my $UnConverted = join('',@_);
	debugV("Entered sub convertNonBreakableSpacetoNumberedSequence");
	$UnConverted =~ s~\ ~*~sg ;
	my @Converted = split(/$/, $UnConverted);
	return( @Converted );}
Which are converted to characters in the next subroutine at l.977:
[code]
sub convertNumberedSequencesToChar{
my $UnConverted = join('',@_);
debugV("Entered sub convertNumberedSequencesToChar");
$UnConverted =~ s~\&\#x([0-9A-Fa-f]{1,6});~chr("0x".$1)~seg ;
$UnConverted =~ s~\&\#([0-9]{1,6});~chr(int($1))~seg ;
return( split(/(\n)/, $UnConverted) );}
[\code]

So the questions is not whether we need another dependency, but why the subroutine is not used or fails for Van Dale FR-NL 2010.

In line 1621 and further sub removeInvalidChars is defined and it also replaces some Perl characters codepoints. Also odd that those remain, if the subroutine convertNumberedSequencesToChar is called.


So when is it called? Apparently, it is called if the SameTypeSequence is not "h". In line 1682 and further:
Code:
# If SameTypeSequence is not "h", remove � sequences and replace them with characters.
if ( $SameTypeSequence ne "h" ){
	@xdxf_reconstructed = convertNumberedSequencesToChar(
							convertNonBreakableSpacetoNumberedSequence( @xdxf_reconstructed )
								) ;
}
So if you introduce an extra toggle, e.g.
Code:
my $ForceConvertNumberedSequencesToChar = 1;
.....
if ( $SameTypeSequence ne "h" or $ForceConvertNumberedSequencesToChar or $isCreatePocketbookDictionary){
......
You could test whether the results are nicer for the Pocketbook.

If you aren't able to run the script yourself, I am willing to give it a whirl and send you the result if you're willing to test.
Markismus is offline   Reply With Quote