10-18-2003, 04:44 PM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2003
Device: Palm IIIx
|
iSiloXC misconverting character entities
With both iSiloXC 3.x and 4.0, certain character entities are misconverted. A sample ixl file follows with one such offending document. Look at the first paragraph. With iSilo 4.0, I see text that reads "had neuroscience?indeed, all the life sciences?in a chokehold". If you view it in a browser, you will see that those question marks are actually em dashes.
The convertor for iSilo 2.x converts the document just fine. <?xml version="1.0"?> <iSiloXDocumentList> <iSiloXDocument> <Source> <Sources> <Path>http://www.firstthings.com/ftissues/ft0305/reviews/dembski.html</Path> </Sources> </Source> <Destination> <Title>Book review</Title> <Files> <Path>/tmp/foo.pdb</Path> </Files> </Destination> <LinkOptions> <MaximumDepth value="0"/> </LinkOptions> <ImageOptions> <AltText value="exclude"/> <Images value="exclude"/> </ImageOptions> <TableOptions> <IgnoreTables value="yes"/> </TableOptions> </iSiloXDocument> |
10-19-2003, 07:35 AM | #2 |
Fully Converged
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
The problem here is that the site in question does not specify its text encoding and yet uses non-standard characters (the em dash).
To fix this, go to your channel settings -> tab Document -> click on Text Encoding Options -> and choose "Western European (ISO-8859-1)" in the second field. You can also make this default for all sites that do not specify their encoding. To do so, go to iSiloX -> Document -> Default Properties. |
10-19-2003, 04:25 PM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2003
Device: Palm IIIx
|
Thanks for your suggestion, but I don't think that is it. When the charset is not specified, iSilo will let you specify it on the PDA itself. But no encoding option allows the correct display of these characters.
The page I pointed to is almost certainly Windows-1252. At least, other pages on that site specifies that encoding. But I have tried specifying that encoding, both in iSilo and in the ixl file for iSiloXC, to no avail. Also, there is no ISO-8859-1 encoding offered, neither for iSilo nor iSiloXC. So there does not seem to be a way for iSilo 3.x/4.0 to display these character entities, at least with their "native" formats. Again, the old iSilo 2.x convertor converts these characters fine. |
10-20-2003, 02:56 AM | #4 |
Fully Converged
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
I converted your page,
http://www.firstthings.com/ftissues/...s/dembski.html with the correct text encoding and it looks just fine in iSilo. I attached both, a converted database and its corresponding ixl. |
10-20-2003, 03:21 AM | #5 |
Developer
Posts: 33
Karma: 2314
Join Date: Oct 2002
Location: US
|
Specifying the character set in iSilo (as opposed to specifying it in iSiloX during conversion) is really only useful for the case where the text of the document is raw such as plain text files or iSilo 2.x/3.x documents that definitely do not have any embedded encoding information (but even then the results might not be perfect in some instances).
For documents generated by iSiloX/C 4.0, if the source content itself does not specify it's encoding, you really do need to specify what the source encoding is during conversion if you want predictable results. Follow what Alexander suggests. It does work. |
10-25-2003, 09:46 PM | #6 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2003
Device: Palm IIIx
|
Thanks, you're both right
Thanks to the both of you for replying. You're right: the CharSet setting did the trick. My mistake was to use it only under DocumentOptions, when I needed to use it under Sources to specify the input encoding.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Unutterably Silly Name this character! | AprilHare | Lounge | 7 | 08-23-2009 05:01 PM |
Named entities or not? | alecE | ePub | 17 | 07-21-2009 12:24 PM |
Test for custom entities in ePUB | Jellby | ePub | 9 | 05-27-2009 06:45 AM |
Can I preserve entities when converting from html? (To avoid unicode on kindle) | krunkster | Calibre | 1 | 04-07-2009 05:11 PM |
Using Turcic with isiloXC on linux | goaliemn | Feedback | 2 | 10-29-2003 04:37 PM |