Shiny New E-Book Gizmo: The Amazon Kindle


View Full Version : Extended characters


jbenny
10-09-2007, 11:31 PM
At the request of JSWolf, I am posting a new thread concerning the use of extended ASCII characters like curly quotes, em-dashes, apostrophes, etc.

I took "An Intimate Study of Sherlock Holmes", recently posted by RWood and tried to open it in FBReader. Like a lot of programs, FBReader didn't display the curly quotes and em-dashes correctly. Thinking that this was due to the use of the extended ASCII characters, instead of the equivalent HTML tags, I used Amber Palm Converter to get some HTML to experiment with. Although my experiment worked, it appeared that the Amber software makes substantial changes to the HTML that it creates.

Below, I have attached a Zip file that contains an HTML file that hopefully is closer to the original. I used a program called MakeDoc to extract this file from the posted PRC. I took this HTML file and replaced all curly quotes, apostrophes and em-dashes with the HTML tags. This second file (also in the Zip) displayed correctly in FBReader.

I don't mean to pick on just FBReader. I have also seen other programs not display extended ASCII characters correctly. I would think that using the HTML tags for these characters should display correctly, in all cases. As I mentioned in the other thread, I think this problem is due to the different interpretation of these characters, depending on the language and code page used (or the improper interpretation by the software - I don't know which).

jbenny
10-09-2007, 11:36 PM
Although I mentioned this in the other post, I thought I would duplicate this here.

Single, left and right curly quotes are:
‘
’

Double, left and right curly quotes are:
“
”

The em-dash is:
—

Other extended ASCII characters and some foreign characters also have equivalent HTML tags.

JSWolf
10-09-2007, 11:38 PM
If I do a simple PRC file then it works in FBReader. I didn't clean this up at all. Just loaded it and made it.

This is the simple-PRC made in BD. The Mobipocket PRC still had the same problems. So I think it's not the actual characters, but whatever format is being output by BD that is not fully compatible with FBReader.

jbenny
10-09-2007, 11:43 PM
The file you posted did not display correctly for me in FBReader. Perhaps BD is substituting the actual character for the HTML tag. Maybe there is an option to disable this?

jbenny
10-09-2007, 11:49 PM
I just used MakeDoc to extract the HTML from that last PRC you posted. It did not contain the HTML tags that I used. As far as I know, MakeDoc doesn't change anything, it just extracts the files. So, it looks to me like BD is doing some character substitution.

jbenny
10-10-2007, 12:18 AM
OK, latest test. I took the Study2.html file that has the HTML tags that I added and used MakeDoc to create a Palm ebook. This opened and displayed correctly in FBReader.

Considering that a MobiPocket ebook is basically HTML, wrapped in a Palm PRC file, I don't know how different what MakeDoc created is from a real MobiPocket ebook. In any event, this seems to show that at least MakeDoc doesn't mess with the HTML tags for extended ASCII characters, like BD does.

jbenny
10-10-2007, 12:30 AM
JSWolf, one other thing we could try is for you to again create a MobiPocket ebook in BD, using that second file I provided (with the HTML tags). Only this time, disable the use of compression when creating the MobiPocket file (if you can). The resulting file can then be looked at with a hex editor (I have one if you don't) to see whether BD did indeed substitute characters on us, as I believe is happening.

jbenny
10-10-2007, 12:49 AM
Nevermind. Deleted.

HarryT
10-10-2007, 01:47 AM
I've moved this thread to "Upload Help" so it doesn't show up in the book index.

jbenny
10-10-2007, 02:49 AM
Harry, thanks for moving this.

Not that it matters for the current problem, but I see from the file that RWood originally posted, that BD uses a code of "TEXtREAd", which corresponds to "Palm DOC" and not "MobiPocket". This is according to the MobileRead Wiki http://wiki.mobileread.com/wiki/PDB. As far as reader software displaying the ebook, I don't know if it matters or not.

HarryT
10-10-2007, 08:26 AM
FWIW, the curly quotes of the original PRC show up fine in MobiPocket Reader on both the PC and Pocket PC versions. I'll try it on my iLiad when I get home from work.

JSWolf
10-10-2007, 08:36 AM
FWIW, the curly quotes of the original PRC show up fine in MobiPocket Reader on both the PC and Pocket PC versions. I'll try it on my iLiad when I get home from work.
The problem is when reading the PRC in FBReader you get garbage instead of some of the proper characters. If we use BD to create a simple-PRC instead of a Mobipocket PRC, FBReader displays these fine. Do you know the difference between Mobipocket and simple-PRC for most books made with BD?

HarryT
10-10-2007, 08:49 AM
Simple PRC is basic PalmDoc. Plain text - no hyperlinks, no styles (bold, italic, centering, etc); no anything, basically :). Not something you want to use.

JSWolf
10-10-2007, 09:36 AM
Simple PRC is basic PalmDoc. Plain text - no hyperlinks, no styles (bold, italic, centering, etc); no anything, basically :). Not something you want to use.
I just noticed as I was fiddling with it. What formats can FBReader read that would keep styles and curly quotes?

jbenny
10-10-2007, 04:19 PM
FBReader can read quite a few formats. However, I don't think that is the problem. Besides, asking people to create yet another format for submissions here seems counterproductive. Like I said, I think that the problem can be easily solved (in any format) by using the HTML tags, instead of the extended characters. I have seen this type of problem in other software before.

Attached is an ebook. I used the HTML that I extracted from the original ebook. I replaced the curly-quotes, curly apostrophes and em-dashes with HTML tags. I then used MakeDoc to create a PRC (without the images). This displays correctly in FBReader.

From looking at the PRC that BD created, both BD and MakeDoc seem to be wrapping the HTML in a PRC file, with compression. My understanding is that this is essentially what a MobiPocket ebook is. The only real difference that I can see is that I used the HTML tags, so that those characters displayed correctly.

jbenny
10-10-2007, 04:22 PM
FWIW, the curly quotes of the original PRC show up fine in MobiPocket Reader on both the PC and Pocket PC versions. I'll try it on my iLiad when I get home from work.

Please also try the ebook that I just posted on your various readers. I'm curious to know if using the tags works in all cases (I think they will).

jbenny
10-10-2007, 05:17 PM
OK, I just downloaded and installed BD. BD is converting these HTML tags to the actual character when loading the source file. I looked, but can find no way to tell BD not to do this.

JSWolf
10-10-2007, 05:32 PM
Well I tried your HTML samples and in BD and the Mobipocket PRC did not display properly in FBReader. The Simple-PRC worked fine. FB2 works fine as well, but the format is kinda like Simple-PRC.

jbenny
10-10-2007, 05:42 PM
Exactly what settings are you using in BD to get a PRC that displays correctly? No matter what I do, I still get little squares. I am trying this on the latest Windows version of FBReader.

jbenny
10-10-2007, 05:47 PM
It seems that FB2 is no good, as you loose the em-dashes and the grave-accent "a" character.

jbenny
10-10-2007, 05:56 PM
I'm fresh out of ideas for now. The only thing that seems to work correctly for me is to extract the HTML file from the original PRC, change the characters to HTML tags, then use MakeDoc to re-create the PRC. Since MakeDoc isn't doing any character conversion like BD, things display fine in FBReader.

JSWolf
10-10-2007, 06:10 PM
I'm fresh out of ideas for now. The only thing that seems to work correctly for me is to extract the HTML file from the original PRC, change the characters to HTML tags, then use MakeDoc to re-create the PRC. Since MakeDoc isn't doing any character conversion like BD, things display fine in FBReader.
I used Simple-PRC which loses the formatting so that's not a good idea.

What I could try is writing out HTML from BD and then using Mobipocket disktop to make the mobi file and try FBReader on it.

jbenny
10-10-2007, 06:27 PM
In addition, you might take the HTML file I posted (with the tags) and run that through MobiPocket Desktop, since we know that BD is doing character substitution. Please post both PRCs, so I can try them, too.

HarryT
10-11-2007, 01:19 AM
I can confirm that the original book displays fine on the iLiad MobiPocket Reader too, so the problem seems to be specific to FBReader.

JSWolf
10-11-2007, 01:50 PM
Here are the results of importing as is into Mobipocket for Windows.

jbenny
10-11-2007, 04:08 PM
OK, this shows that MobiPocket for Windows isn't messing with the source file. That is a good thing.

JSWolf
10-11-2007, 04:14 PM
But what are those aaaaa I see?

jbenny
10-11-2007, 05:49 PM
If you mean at the start of a pargraph, I'm not seeing those in the last two files you posted. The only time I saw them was when I was trying different character-set encodings in the HTML source. What language do you have FBReader set to? Given that it doesn't have a selection for "English", I chose "Other".

JSWolf
10-12-2007, 09:03 AM
It was set to Russian. I didn't know about the language setting.

I set it to Other and then reloaded one of the books and the aaaa is gone now. Thanks.