Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 06-16-2010, 10:38 AM   #1
rheostaticsfan
Connoisseur
rheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enough
 
Posts: 93
Karma: 591
Join Date: May 2008
Device: kindle, ipad, ipod touch, Blackberry
how to tell the character encoding???

I have problems with files in Calibre from time to time. Sometimes they're prc's that I "convert" to mobi to embed my new metadata. Sometimes they're epub or .pdb files which I convert to mobi.

Often the em dashes and/or the apostrophes and sometimes even the quotation marks are replaced with squares in the converted file.

After doing some digging here I gather that this may be an "input character encoding" problem and I need to put the appropriate encoding type into my preferences.

I cannot understand how I'm supposed to determine what my character encoding is? I tried cp1252 which I gather is common. That didn't help me, so I guess it's a different codec: but I have no idea how to figure out which one.

Can anyone help?
rheostaticsfan is offline   Reply With Quote
Old 06-16-2010, 10:55 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 24,823
Karma: 4369673
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Te easiest solution is to simply use the transliterate unicode characters option which will replace these special characters with their plain ascii equivalents.
kovidgoyal is online now   Reply With Quote
Old 06-16-2010, 10:59 AM   #3
rheostaticsfan
Connoisseur
rheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enough
 
Posts: 93
Karma: 591
Join Date: May 2008
Device: kindle, ipad, ipod touch, Blackberry
I actually tried that...but it simply REMOVED all of the apostrophes in the document.
rheostaticsfan is offline   Reply With Quote
Old 06-16-2010, 11:04 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by rheostaticsfan View Post
After doing some digging here I gather that this may be an "input character encoding" problem and I need to put the appropriate encoding type into my preferences.

I cannot understand how I'm supposed to determine what my character encoding is? I tried cp1252 which I gather is common. That didn't help me, so I guess it's a different codec: but I have no idea how to figure out which one.

Can anyone help?
Now you see the problem that Calibre has - how does one determine the character encoding if it isn't specified? Calibre has tools to try to do it automatically, but they aren't perfect. You could look at the problem characters, try to figure out what they ought to be, then find an encoding that matches, then see if all the other characters look OK.

Most people don't do it that way, however. They just try reasonable options until one seems to work. Here are the ones I usually try:
cp1252
cp1251
latin1
utf-8
Starson17 is offline   Reply With Quote
Old 06-16-2010, 11:10 AM   #5
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 13,611
Karma: 5126946
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by rheostaticsfan View Post
I actually tried that...but it simply REMOVED all of the apostrophes in the document.
Similar thing happened on XP SP3: converting a simple TXT file

Curly Quotes (93,94) and Apostrophes(92) DELETED when converted to EPUB
Transliterate enabled. This is a simple TXT file, so there are no internals that declare charset (that was used)
theducks is online now   Reply With Quote
Old 06-16-2010, 11:12 AM   #6
rheostaticsfan
Connoisseur
rheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enough
 
Posts: 93
Karma: 591
Join Date: May 2008
Device: kindle, ipad, ipod touch, Blackberry
It seems that the book might have broken at the input stage.

I bought the book then ran it through ereader2html then input the html into calibre. The output of ereader2html looks fine, but when I click V in calibre it shows me a winzip folder (it imported as zip). If I open the book file in there the em dashes and apostrophes are replaced with squares.

At that point any converting I do won't help.

So is there a way to import html without losing emdashes and apostrophes?

For now I found the workaround of opening the html file in mobipocket creator and outputting a prc file.

It's adding extra steps to what is already a fairly arduous process. Is there a way to streamline?
rheostaticsfan is offline   Reply With Quote
Old 06-16-2010, 11:21 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 24,823
Karma: 4369673
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://calibre-ebook.com/user_manual/faq.html#id15
kovidgoyal is online now   Reply With Quote
Old 06-16-2010, 02:29 PM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by theducks View Post
Curly Quotes (93,94) and Apostrophes(92) DELETED when converted to EPUB
Transliterate enabled. This is a simple TXT file, so there are no internals that declare charset (that was used)
I found the same thing going txt->txt. I expected it to convert Curly Double Quotes (93, 94) -> to Ordinary Double Quote (22) and Curly Single Quotes - Apostrophes (91, 92) to Ordinary Single Quote (27). Bug?
Starson17 is offline   Reply With Quote
Old 06-16-2010, 02:40 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 24,823
Karma: 4369673
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Did you specify the correct encoding for the TXT file in input encoding?
kovidgoyal is online now   Reply With Quote
Old 06-16-2010, 02:45 PM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
Did you specify the correct encoding for the TXT file in input encoding?
Nope. I left it blank. In retrospect, it's obvious I needed to specify an encoding. Plus, I tried importing an html file with smart quotes. It kept all the smart quotes in the html, although it did convert the original file so that each byte now has an associated 00 byte.

Edit: Yes, adding the correct encoding CP1252 caused it to convert as expected.

Last edited by Starson17; 06-16-2010 at 02:55 PM.
Starson17 is offline   Reply With Quote
Old 06-16-2010, 03:03 PM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by rheostaticsfan View Post
when I click V in calibre it shows me a winzip folder (it imported as zip).
That's normal - html files are grouped into a zip file to keep them together.

Quote:
If I open the book file in there the em dashes and apostrophes are replaced with squares.

At that point any converting I do won't help.
Possibly, but not certainly. I have lots of programs that show squares whenever they can't decide the encoding. To be sure, you have to use a hex editor and look. Even my hex editor has problems. If I look at the file in hex mode, there is an adjacent window that's supposed to show ASCII. It shows single smart quotes correctly, but the double smart quotes both appear as the right double quote. In text mode, it has even more problems, and this is all inside the same program!

Quote:
So is there a way to import html without losing emdashes and apostrophes?
My test of smart double and single quotes showed correct importing into Calibre. (Characters 0x91-0x94 ) What encoding/hex values for emdashes did you use? I tested 0x97 (emdash) and that worked.

Last edited by Starson17; 06-16-2010 at 03:09 PM.
Starson17 is offline   Reply With Quote
Old 06-16-2010, 03:47 PM   #12
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 13,611
Karma: 5126946
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by Starson17 View Post
Nope. I left it blank. In retrospect, it's obvious I needed to specify an encoding. Plus, I tried importing an html file with smart quotes. It kept all the smart quotes in the html, although it did convert the original file so that each byte now has an associated 00 byte.

Edit: Yes, adding the correct encoding CP1252 caused it to convert as expected.
Duh! Where do you set The encoding for a TXT file. I looked all over the TXT Input tab of preferences and the individual section on the convert. spaces, paragraphs... Yes. Encoding?
V .7.2
theducks is online now   Reply With Quote
Old 06-16-2010, 04:20 PM   #13
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by theducks View Post
Duh! Where do you set The encoding for a TXT file. I looked all over the TXT Input tab of preferences and the individual section on the convert. spaces, paragraphs... Yes. Encoding?
V .7.2
Look & Feel | Input character encoding
Starson17 is offline   Reply With Quote
Old 06-16-2010, 04:57 PM   #14
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 13,611
Karma: 5126946
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by Starson17 View Post
Look & Feel | Input character encoding
(and again)

theducks is online now   Reply With Quote
Old 06-16-2010, 07:36 PM   #15
rheostaticsfan
Connoisseur
rheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enoughrheostaticsfan will become famous soon enough
 
Posts: 93
Karma: 591
Join Date: May 2008
Device: kindle, ipad, ipod touch, Blackberry
Quote:
Originally Posted by kovidgoyal View Post
I saw that. It seems to tell me to input the proper encoding. But that's how I started the thread. How do I know what the proper encoding is?

Or is there something I'm missing???
rheostaticsfan is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pdf to epub Turkish character encoding problem blueresistance Conversion 1 02-25-2011 05:31 PM
Encoding prusaks Recipes 0 09-27-2010 06:25 AM
how to add encoding? nsg Calibre 5 02-25-2009 09:51 PM
Character encoding in the filesystem Jellby Bookeen 1 03-30-2008 05:36 AM
FBReader fixes character encoding problem jbenny News 1 10-18-2007 10:50 PM


All times are GMT -4. The time now is 10:52 AM.


MobileRead.com is a privately owned, operated and funded community.