View Single Post
Old 02-05-2020, 03:09 AM   #280
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 959
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
Yes, I see it. Garbage in = garbage out, I am afraid.

This is the entry in the chunk c_1 in the archive dict-file:
Code:
<b>A,</b> N.&nbsp;m. [<f>&a;</f>] ou [<f>&â;</f>] Voyelle et première lettre de l'alphabet. <i>Une panse d' <i>a </i>, </i>la première partie d'un petit <i>a </i> dans l'écriture. <f>&os;</f>&nbsp;<i>N'avoir pas fait une panse d' <i>a </i>, </i>c'est-à-dire n'avoir rien écrit. <f>&ns;</f>&nbsp;<i>Prouver par A + B, </i>avec précision et rigueur. <f>&ns;</f>&nbsp;<i>De A à Z, </i>du début à la fin. <f>&ns;</f>&nbsp;<i>A4, </i>format d'une feuille de papier de 21&nbsp; X &nbsp;29,7&nbsp;cm. <i>A3, </i>format 29,7&nbsp; X &nbsp;42&nbsp;cm. <f>&ns;</f>&nbsp;La.
And this is the entry in the Stardict file generated by Penelope:


As you can see, it is not the accented characters that are the problem, rather in the idx-file there is a doctype definition given, which defines all those entities:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ENTITY ns "♦">
<!ENTITY os "•">
<!ENTITY oo "›">
<!ENTITY co "‹">
<!ENTITY a  "a">
<!ENTITY â  "&#x0251;">
<!ENTITY an "&#x0251;&#x303;">
<!ENTITY b  "b">
<!ENTITY d  "&#x0257;">
<!ENTITY e  "&#x0259;">
<!ENTITY é  "e">
<!ENTITY è  "&#x025B;">
<!ENTITY in "&#x025B;&#x303;">
<!ENTITY f  "f">
<!ENTITY g  "&#x0261;">
<!ENTITY h  "h">
<!ENTITY h2 "&#x0027;">
<!ENTITY i  "i">
<!ENTITY j  "J">
<!ENTITY k  "k">
<!ENTITY l  "l">
<!ENTITY m  "m">
<!ENTITY n  "n">
<!ENTITY gn "&#x0272;">
<!ENTITY ing "&#x0273;">
<!ENTITY o  "o">
<!ENTITY o2 "&#x0254;">
<!ENTITY oe "&#x0276;">
<!ENTITY on "&#x0254;&#x303;">
<!ENTITY eu "&#x0278;">
<!ENTITY un "&#x0276;&#x303;">
<!ENTITY p  "p">
<!ENTITY r  "&#x0280;">
<!ENTITY s  "s">
<!ENTITY ch "&#x0283;">
<!ENTITY t  "t">
<!ENTITY u  "&#x0265;">
<!ENTITY ou "u">
<!ENTITY v  "v">
<!ENTITY w  "w">
<!ENTITY x  "x">
<!ENTITY y  "y">
<!ENTITY z  "z">
<!ENTITY Z  "&#x0292;">]>
<html xml:lang="fr"
	xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title></title>
	</head>
	<body>
However, the &nbsp sequence is normal html for non-breakable-space. If you change the sametypesequence in the ifo file from m to h they don't disappear from linguae, but Goldendict does switch (as does Koreader):

sametypesequence=m



sametypesequence=h

Last edited by Markismus; 02-05-2020 at 04:04 AM.
Markismus is offline   Reply With Quote