Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 11-27-2013, 04:18 AM   #1
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
accented chars are weirdly managed

Might I chase for an help here?
I'm facing a problem when I edit an italian epub using Sigil which shows me stuff like:
"èacquistabile"
"accessibilitàalle"

It seems that the accented characters ruin the rendering; the correct ones should be:
"è acquistabile"
"accessibilità alle"
  • The epub works greatly when opened with Readers
  • Copying/pasting the wrong text (e.g. onto MS Word) is not effected by the error.
  • Moving the cursor using the arrows throu the error "looses one step" and do not move for one key press.

I own a recent Sony Vaio with updated NVidia GEForce, Windows 7, Sigil 0.7.4
Thank in advance for any help
p.s. sorry my poor English, I did my best
pgfiore is offline   Reply With Quote
Old 11-27-2013, 04:49 AM   #2
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
I suspect there is a character after the accent that is causing this to happen. You might check the following in code view. Position the cursor before the accented character. Press now the right arrow key twice. Where is the cursor now? Did it appear to move after the accented characater?
Also, how does it look in code view?
Toxaris is offline   Reply With Quote
Advert
Old 11-28-2013, 08:17 AM   #3
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
Ciao Toxaris, sorry I did not explained it clearly.
I always use Sigil in code view; the example I posted had been got by the code view (I did not know book view btw).
I can add now that onto the book view chars are correct, there is a blank and the cursor move with no" gap" from left to right with the right arrow.

On the contrary, in code view, I do not see the blank; while moving from left to right I need to hit the arrow twice to pass the accented vowel!
Using this example, "èacquistabile" (code view), I need two arrow hits to move from "è" to "a"...
pgfiore is offline   Reply With Quote
Old 11-28-2013, 09:12 AM   #4
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
That is what I expected. There is a hidden character between the accented letter and the non-accented letter. What is the source of the document?

It could be a thin space or a zero width joiner.
Toxaris is offline   Reply With Quote
Old 11-28-2013, 09:13 AM   #5
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,681
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by pgfiore View Post
Using this example, "èacquistabile" (code view), I need two arrow hits to move from "è" to "a"...
That's because the accented è character is composed of two characters:

Code:
e U+0065 LATIN SMALL LETTER E
Code:
 ̀̀ U+300 COMBINING GRAVE ACCENT
I.e. this behavior is by design, because most Unicode editors behave this way.
Most likely your display issues are caused by these and other combining characters. Try replacing them with combined characters.

For example, replace U+0065 & U+0300 with:
Code:
è U+00E8 LATIN SMALL LETTER E WITH GRAVE
You can select this an other accented characters with the Windows Character Map.

EDIT: To identify the problematic Unicode characters, visit this website, paste a paragraph with accented characters and missing spaces into the Unicode Text box, click Convert and post the results here.

Last edited by Doitsu; 11-28-2013 at 10:36 AM.
Doitsu is offline   Reply With Quote
Advert
Old 11-28-2013, 09:46 AM   #6
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
If you don't see a space in code view, I'd say that's a bug (probably in some Qt element). If the problem is what Doitsu says (é encoded as two characters), I would expect this to happen

|è acquistabile
(press right arrow)
è| acquistabile
(press right arrow)
è| acquistabile (again!)
(press right arrow)
è |acquistabile

but:

|è acquistabile
(press right arrow)
è| acquistabile
(press "o" key)
eò| acquistabile
(press right arrow)
eò| acquistabile
(press "o" key)
eòo| acquistabile

which makes sense if you consider the encoding sequence is actually:

e` acquistabile

But the combining accent causing the following space to disappear doesn't look correct... unless there's some other catch, like the space being some kind of special space.
Jellby is online now   Reply With Quote
Old 11-28-2013, 12:51 PM   #7
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
You can right click open with to a hex viewer like pspad at this particular point and see what it shows in the hex viewer. You can look to see if there is a character number there that is not between other characters in the document and delete it and see what happens (after making sure you have a backup.)

Sigil, in the past, has not shown all characters in code view. This may be changed now that it is using the updated version of QT.

This have never occurred for me when it was created in Sigil. But it has happened when I scraped something off a web page.
mrmikel is offline   Reply With Quote
Old 11-29-2013, 04:20 AM   #8
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
If you don't see a space in code view! My issue is only by code view! While using book view the text is perfect...

- CODE VIEW
|èacquistabile
(press right arrow)
|èacquistabile (nothing happened)
(press right arrow)
è|acquistabile (seems there's a blank using "|", but the two vowels are linked together btw)

- HEX
65CC8020616371756973746162696C65 (continuos string, no special html inside)

"CC80"???? They are two, definitely; tea for two bloody chars?

- I visit this website
The html conversion of the string (coming from the code view) is: "e$#768; acquistabile" (with $ used in place of &)
BUT kindly note the unicode string *shows* the space when pasted in the box

thank for the help to you all
ciao

Last edited by pgfiore; 11-29-2013 at 04:26 AM.
pgfiore is offline   Reply With Quote
Old 11-29-2013, 05:09 AM   #9
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pgfiore View Post
- HEX
65CC8020616371756973746162696C65 (continuos string, no special html inside)

"CC80"???? They are two, definitely; tea for two bloody chars?
OK, that is correct. CC80 is the combining grave accent encoded in UTF-8. And yes, UTF-8 uses two bytes for many characters, those outside the basic ASCII... it even uses four bytes for CJK characters, I believe... There's just no way you can use only one byte for every character, when you you have many more than 256 of them

http://www.fileformat.info/info/unic...0300/index.htm

So the string is "e", "combining grave accent", "space", "a", "c", etc. Nothing wrong with that, and there's definitely a bug in code view if it's not showing the space. But it's not necessarily a bug in Sigil, it could be in some of the libraries it's using.

As you have been told already, the easiest solution is to use the precomposed character "è", instead of "e"+"combining grave accent". That's:

C3 A8 20 61 63 71 75 69 73 74 61 62 69 6C 65

http://www.fileformat.info/info/unic...00e8/index.htm
Jellby is online now   Reply With Quote
Old 11-29-2013, 08:44 AM   #10
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
coding cobol I never used more than... well should be a couple of dozens of charactes, upper case only ca va sans dire! ;-)
64 ought to be enough for anybody!!!

And know?
I asked the author of "[...] implement it yourself and submit a pull request on GitHub.", but seems they dont accept patches in cobol.
pgfiore is offline   Reply With Quote
Old 11-29-2013, 11:09 AM   #11
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
One question remains unanswered. What is the source of the ePUB or HTML used?
Toxaris is offline   Reply With Quote
Old 12-04-2013, 04:44 AM   #12
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
Well Toxaris, not sure to catch the real meaning of your question; give it a try.
The Source is:
- a mobi directly converted to epub by one of the latest calibre.
- a piece of code like that, in yellow the involved words (pls note I cannot guarantee the "è" is still coded correctly after two copy/paste; the real hex is here above):

<?xml version='1.0' encoding='utf-8'?>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>Mio Titolo
</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<link href="stylesheet.css" rel="stylesheet" type="text/css"/>

<link href="page_styles.css" rel="stylesheet" type="text/css"/>

</head>

<body class="calibre">
<p id="filepos305" class="calibre1">
<span class="calibre2">
<span class="bold">MIO TITOLO
</span>
</span>
</p>
<p class="calibre1">
<span class="calibre3">AUTHOR
</span>
</p>
<p class="calibre1">
<span class="calibre3">TRATTO DA UNA REALTÀ ATTUALE
</span>
</p>
<p class="calibre4" style="margin:0pt; border:0pt; height:1em">*
</p>
<p class="calibre4" style="margin:0pt; border:0pt; height:1em">*
</p>
<p class="calibre5">
<span class="calibre3"> Il romanzo, cartaceo e ebook (Epub e Mobi), è acquistabile sul sito*
</span>
<a href="www.miosito.it">
<span class="calibre3">
<span class="calibre6">
<span class="underline">www.miosito.it
</span>
</span>
</span>
</a>
<span class="calibre3">

</span>
</p>
<p class="calibre5">
<span class="calibre3"> Versione Epub per l’accessibilità alle persone ipovedenti e non vedenti (lettura audio, braille digitale, e a caratteri ingranditi).
</span>
</p>
<span class="calibre7">
</span>
</p>
<p class="calibre1">
<span class="calibre3">Obi Wan</span>
</p>
<p class="calibre4" style="margin:0pt; border:0pt; height:1em">*
</p>
<div class="mbppagebreak" id="calibre_pb_0">
</div>

</body>
</html>
pgfiore is offline   Reply With Quote
Old 12-04-2013, 08:24 AM   #13
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pgfiore View Post
<span class="calibre3"> Versione Epub per l’accessibilità alle persone ipovedenti e non vedenti (lettura audio, braille digitale, e a caratteri ingranditi).
Funny that encoding the accents as combining characters is more likely to break text-to-speech
Jellby is online now   Reply With Quote
Old 12-06-2013, 03:54 AM   #14
pgfiore
Enthusiast
pgfiore began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Dec 2012
Location: Italy
Device: Kindle
That sentence is by the editor, while the epub is generated by Calibre from mobi.
I never saw the original epub; could be different I feel, couldn't?
pgfiore is offline   Reply With Quote
Old 12-06-2013, 04:16 AM   #15
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,681
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Toxaris View Post
One question remains unanswered. What is the source of the ePUB or HTML used?
Actually, the answer is irrelevant, because what the OP described is either a Sigil bug or a Qt 5.1.0 bug.

Any word ending in a base character followed by a combining character a space and additional characters will be displayed without a space in Code View. The only known fix is to replace all composite characters with their combined Unicode equivalents.

You can easily test this by looking at my test file. The first word, Thé, ends in e + 'COMBINING ACUTE ACCENT' (U+0301) followed by a space which is not displayed in Code View mode.
Most likely nobody noticed this bug, because combining accents aren't really necessary anymore and most Unicode texts contain (pre-combined) accented characters.
Attached Files
File Type: epub unicode_test.epub (1.8 KB, 251 views)
Doitsu is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Kindle Touch - Book titles acting weirdly Katdragon Amazon Kindle 2 01-01-2013 07:57 AM
Speakin' of weird: Linux build eating accented chars. Hitch Sigil 2 12-17-2010 01:24 PM
iRiver Story managed by Calibre mareksuski Calibre 14 02-19-2010 02:45 AM
Replacing Chars in URL DAiki Calibre 5 10-13-2008 09:25 AM


All times are GMT -4. The time now is 08:29 AM.


MobileRead.com is a privately owned, operated and funded community.