View Single Post
Old 10-16-2019, 07:09 PM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by phossler View Post
@BR -- good tip. Would have saved a lot of head scratching.

That EOPA char was 0-width and invisible in the edit window. At least it showed as an empty square in Preview, or I never would have know I had a problem
In Sigil, if you doubleclick on the character in the report, it should insert it into Find.

In Calibre, it doesn't put it in Find, but it does jump you to the next spot the character occurs.

Side Note: And I also take a quick look at that Report on every book for any strange characters. Usually things like soft hyphens stand out (if the thousands of red squigglies didn't give it away).

One of the latest books I worked on accidentally had the Cyrillic letter С instead of the Latin capital C.

Quote:
Originally Posted by phossler View Post
@Tex2002ans -- thanks for the info. I'm going to assume that the EOPA was intended to be an em dash and not some weird old control character
I can't find the link now, but a few of the sites discussed the technical encoding issues between:

- Windows-1252
- ISO-8859-1
- Unicode

While they're mostly the same... the obscure control code points just so happen to be where many differences lie. So when you make (wrong) assumptions about encoding:

EN DASH (original) -> "Start of Protected Area" (Unicode)
EM DASH (original) -> "End of Protected Area" (Unicode)

Programs botch encoding along the way!

https://stackoverflow.com/questions/...h-151-and-8212
https://stackoverflow.com/questions/...nicode-in-java
https://unix.stackexchange.com/quest...cter-in-a-file
https://stackoverflow.com/questions/...rea-characters

Doesn't help that many browsers/renderers also decide to be helpful and assume you were a dunce... and display those characters instead of keeping them invisible (look at the "Browser" column):

https://www.fileformat.info/info/uni...ement/list.htm

So it can easily still LOOK like an EM DASH (U+2014), even though under the surface it's the END OF GUARDED AREA (U+0097).

Last edited by Tex2002ans; 10-16-2019 at 07:19 PM.
Tex2002ans is offline   Reply With Quote