Forgive my ignorance here.
This is an example of RTL language, correct.
Assuming that the first screenshot is "how it looks" and the second screenshot is "how it should look":
I loaded your sampleembed.epub and I see the second picture based on the far right first word on the top line.
Is that correct behaviour or broken behaviour?
What version of Sigil are you using and on what platform?
|