MobileRead Forums - View Single Post

eureka · 02-06-2012, 11:27 AM

Quote:

Originally Posted by JustAMan

While reading & translating the resulting file I found something that should really be blacklisted - there're many CSS styles (or part of styles) out there! I don't think it would benefit the end user if a translator (say, not knowing HTML/CSS at all) translates things like "{width}px" to {width}something".

OK, that string in pillow/en_US/strings/media_player_bar_strings.properties with a bit of CSS will be blacklisted.

Quote:

Originally Posted by JustAMan

eureka,
Your fix to js compilation of media player isn't complete

See screenshot. However it might be not your fault, but Amazon's... I cannot find where this MessageFormat stuff is defined, so it might even be built-in which would be bad...

And could anyone tell me where this "Off" button resides? I think I saw it as a picture, so could be hard to translate, but feel free to prove me wrong.

BTW, I have an issue with USB plug screen containing no text at all, and the same issue reported by other users of ru_RU locale. Any ideas?

UPD
A bit of searching lead me to conclusion that they might be using MessageFormat class from icu4.jar:com.ibm.icu.text. If that class has a bug translating UTF-8 text to displayed text then translation of this element in non-latin encoding might be doomed... unless we patch this class

Try new version of js_resources tool. This quirk with wrong encoding of message could be fixed.

No messages at blanket part (USB plug screen etc.) possibly means lack of ru_RU locale definition. Sorry, can't say for sure, KT is far away from me.

And any Java class isn't caused that garbled string. (Pillow HTML/JS rendering is handled by Webkit engine, not by JVM). It's most probably (as I've said) bug in js_resources tool, when it outputs string in wrong encoding (i.e. not UTF-8).

BTW, I've commited some sanitizing of HTML in JS resources. Now it escapes all tags except  , , <a> and strips all attributes from any of these whitelisted tag. It is based on html5lib sanitizer and customized with adding of sanitized prefix to escaped tag and with custom handling of named HTML entities in text (so   occured in resources isn't replaced by Unicode character). html5lib includes (and uses) full-featured HTML parser, so it should be safer than regexp-based approach or any other simple parser (like Python's bundled SGMLParser or HTMLParser).

Example:
source: test<script type="text/javascript">alert('pwned!');</script> simple paragarphparagraph with attribute<a href="javascript:alert('pwned too!')">click me!</a>bold text

result: test<sanitizedscript type="text/javascript">alert('pwned!');</sanitizedscript> simple paragarphparagraph with attribute<a>click me!</a><sanitizedb>bold text</sanitizedb>

But still, it is dangerous to include JS resources in automated build of localization bundles. There are some URLs in them (like default bookmarks in browser or search URLs in browser) which also could be misused... I have to blacklist all URLs from JS resources.