Quote:
Originally Posted by phossler
@BR -- good tip. Would have saved a lot of head scratching.
That EOPA char was 0-width and invisible in the edit window. At least it showed as an empty square in Preview, or I never would have know I had a problem
|
In Sigil, if you doubleclick on the character in the report, it should insert it into Find.
In Calibre, it doesn't put it in Find, but it does jump you to the next spot the character occurs.
Side Note: And I also take a quick look at that Report on every book for any strange characters. Usually things like soft hyphens stand out (if the thousands of red squigglies didn't give it away).
One of the latest books I worked on accidentally had the Cyrillic letter С instead of the Latin capital C.
Quote:
Originally Posted by phossler
@Tex2002ans -- thanks for the info. I'm going to assume that the EOPA was intended to be an em dash and not some weird old control character
|
I can't find the link now, but a few of the sites discussed the technical encoding issues between:
- Windows-1252
- ISO-8859-1
- Unicode
While they're mostly the same... the obscure control code points just so happen to be where many differences lie. So when you make (wrong) assumptions about encoding:
EN DASH (original) -> "Start of Protected Area" (Unicode)
EM DASH (original) -> "End of Protected Area" (Unicode)
Programs botch encoding along the way!
https://stackoverflow.com/questions/...h-151-and-8212
https://stackoverflow.com/questions/...nicode-in-java
https://unix.stackexchange.com/quest...cter-in-a-file
https://stackoverflow.com/questions/...rea-characters
Doesn't help that many browsers/renderers also decide to be helpful and assume you were a dunce... and display those characters instead of keeping them invisible (look at the "Browser" column):
https://www.fileformat.info/info/uni...ement/list.htm
So it can easily still LOOK like an EM DASH (U+2014), even though under the surface it's the END OF GUARDED AREA (U+0097).