Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-25-2009, 07:06 AM   #1
mjmcleod
Connoisseur
mjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to behold
 
Posts: 55
Karma: 11501
Join Date: Jul 2009
Location: Australia
Device: Galaxy Tab
Converting eReader books

I have a collection of eReader books that I've built up over quite a few years. Having "liberated" one using ereader2html.py I'm trying to convert it to EPUB.

What I'm finding is that the resulting output has lots of odd characters, particularly next to things like quote marks and emdashes, but also next to letters that are supposed to be accented.

Searching around I found the suggestion to make sure that "cp1252" is specified in the "source encoding" field when doing a conversion. I haven't found this to work.

The one thing I have found that works reliably is to use Mobipocket Creator to convert the book from HTML to MOBI, and then Calibre has no problem converting the result to EPUB. Creator is correctly identifying the source as being in CP1252 and takes care of it. But this is not exactly an automated process.

Is there something else I'm missing? Some other step I could be taking to fix this? I've tried both the previous (0.5.something?) and current (0.6.0) versions of Calibre.
mjmcleod is offline   Reply With Quote
Old 07-25-2009, 07:46 AM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,433
Karma: 950001
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Open a bug at http://calibre.kovidgoyal.net and either attach the file your having trouble with or pm/email it to me so I can figure out why it's not converting correctly.
user_none is offline   Reply With Quote
 
Enthusiast
Old 07-25-2009, 09:06 AM   #3
mjmcleod
Connoisseur
mjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to behold
 
Posts: 55
Karma: 11501
Join Date: Jul 2009
Location: Australia
Device: Galaxy Tab
Righto, have done, ticket is #2923.
mjmcleod is offline   Reply With Quote
Old 07-25-2009, 01:20 PM   #4
Jane A
Addict
Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'Jane A can spell AND pronounce 'liseuse.'
 
Posts: 211
Karma: 39127
Join Date: Jan 2009
Location: SoCal
Device: PocketPro/NookSTG
That's interesting. The files I've liberated using ereader2html which I then try to convert to lit have odd characters, also.
Jane A is offline   Reply With Quote
Old 07-25-2009, 02:00 PM   #5
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,433
Karma: 950001
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
The file is encoded in cp1252. The original PML is in cp1252 and eReader2html preserves this. The problem is detecting the encoding. cp1252 is a superset of the Latin-1 encoding. Latin-1 is a subset of utf-8. calibre internally uses utf-8. When detecting file encoding the first few bytes of the file are tested. The fist few bytes of a book converted with eReader2html will be <html>. Which are valid Latin-1 characters and the document is encoded with utf-8.

There is no good or easy way to determine the actual encoding of the file. The two options are, check for any of a number of cp1252 specific characters within the file. Or try encoding with every codepage and see if it succeeds. Both are time consuming and wasteful.

One other option would be to modify eReader2html to encode the file as utf-8.

For the time being you will just have to specify the --input-encoding="cp1252" when converting.
user_none is offline   Reply With Quote
Old 07-25-2009, 09:16 PM   #6
Statch
Connoisseur
Statch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it is
 
Statch's Avatar
 
Posts: 95
Karma: 2084
Join Date: Aug 2008
Location: Georgia, USA
Device: Kindle PW2, Samsung Galaxy 3, Kindle Fire HD
Just a note to say that I am having a similar problem. All the books I've converted using ereader2html either have the funny characters when I use Calibre to convert them to ePub (or Mobi), or are missing pieces of punctuation altogether (em-dashes and quotes), depending on whether I tell it to use cp1252 or not.

Thanks to the previous poster for suggesting using Mobipocket Creator as an interim step. I'll give that a try. Reading the book with either the missing punctuation or the extra characters is a bit distracting.
Statch is offline   Reply With Quote
Old 07-25-2009, 10:06 PM   #7
mjmcleod
Connoisseur
mjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to behold
 
Posts: 55
Karma: 11501
Join Date: Jul 2009
Location: Australia
Device: Galaxy Tab
I've done a little further poking, and it looks like conversion from the command line (e.g., "ebook-convert book.html book.epub --input-encoding=cp1252" works. It's just doing it from the GUI that doesn't.
mjmcleod is offline   Reply With Quote
Old 07-25-2009, 11:05 PM   #8
ficbot
Wizard
ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.ficbot ought to be getting tired of karma fortunes by now.
 
Posts: 2,389
Karma: 4115574
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
I used to be able to just runt e converter and off I went, but lately the ereader files I have tried have needed much more work. I don't know if the eReader people have changed something or what, but the punctuation if often variable as people have said, and the text is always centered. I used to run a complex conversion whereby I exported it to epub saved it as RTF, tidied up the RTF then saved it as HTML again. But lately, that isn't working either---I am getting three lines between paragraphs every time. I have tried editing these out in Konpozer and sometimes it works and sometimes it does not because there will be too much extra formatting that has to be taken out manually and I just don't have the time.

I did some playing around with RTF today. I had to set the font to 16 for it to even be readable, it didn't justify and the files were massive. I am at my wit's end here. I have about 15 books I want to convert. Ideas, please? I am on a Mac if that helps.
ficbot is offline   Reply With Quote
Old 07-25-2009, 11:21 PM   #9
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,433
Karma: 950001
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Thumbs down

@ficbot, How complex are the ebooks in question? If they are run of the mill novels you could convert them to txt, use the markdown syntax to setup your formatting then convert the txt file to your preferred format. This might or might not be less work than you're currently doing.
user_none is offline   Reply With Quote
Old 07-25-2009, 11:37 PM   #10
FizzyWater
You kids get off my lawn!
FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.
 
FizzyWater's Avatar
 
Posts: 2,894
Karma: 5613301
Join Date: Aug 2007
Location: Columbus, Ohio
Device: Dell Axim, PRS350/650, Nook Glow, PB Touch Lux 623
Quote:
Originally Posted by mjmcleod View Post
I've done a little further poking, and it looks like conversion from the command line (e.g., "ebook-convert book.html book.epub --input-encoding=cp1252" works. It's just doing it from the GUI that doesn't.
Are you saying that when you change the encoding in GUI that it doesn't work for you? Or that you don't know how to do it via the GUI? Because I've been using that feature in Calibre (most of my DRM books are eReader) and it works for me.
FizzyWater is offline   Reply With Quote
Old 07-26-2009, 12:16 AM   #11
mjmcleod
Connoisseur
mjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to behold
 
Posts: 55
Karma: 11501
Join Date: Jul 2009
Location: Australia
Device: Galaxy Tab
Quote:
Originally Posted by FizzyWater View Post
Are you saying that when you change the encoding in GUI that it doesn't work for you? Or that you don't know how to do it via the GUI? Because I've been using that feature in Calibre (most of my DRM books are eReader) and it works for me.
I'm saying that when I change the encoding in the GUI ("Look & Feel" page of the conversion dialog, "Input character encoding") it doesn't work for me, but if I use the CLI ("ebook-convert ... --input-encoding=cp1252") it does work.

I suspect it may be down to a quirk of the packaging for the OS X version which includes its own python binary. When called from the shell ebook-convert is using the system python (at least in the first instance, there's some fancy footwork going on that I'm not sufficiently familiar with Python to be sure about) while the GUI is using the python binary that ships with Calibre.
mjmcleod is offline   Reply With Quote
Old 07-26-2009, 12:26 AM   #12
FizzyWater
You kids get off my lawn!
FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.FizzyWater ought to be getting tired of karma fortunes by now.
 
FizzyWater's Avatar
 
Posts: 2,894
Karma: 5613301
Join Date: Aug 2007
Location: Columbus, Ohio
Device: Dell Axim, PRS350/650, Nook Glow, PB Touch Lux 623
Wow. What a pain. I'm using Windows XP, and I'm beginning to wonder if conflicting versions of Python is causing my problem with install to 0.6...
FizzyWater is offline   Reply With Quote
Old 07-26-2009, 01:03 AM   #13
mjmcleod
Connoisseur
mjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to beholdmjmcleod is a marvel to behold
 
Posts: 55
Karma: 11501
Join Date: Jul 2009
Location: Australia
Device: Galaxy Tab
I may be completely wrong about Python version mismatches/etc. I'm definitely seeing the behaviour I've described (works from the CLI but not the GUI) but on re-reading ebook-convert it's calling the bundled python interpreter to do the real work.
mjmcleod is offline   Reply With Quote
Old 07-26-2009, 07:38 AM   #14
Statch
Connoisseur
Statch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it is
 
Statch's Avatar
 
Posts: 95
Karma: 2084
Join Date: Aug 2008
Location: Georgia, USA
Device: Kindle PW2, Samsung Galaxy 3, Kindle Fire HD
I just tried using the command line instead of the GUI for a book that I had previously had trouble converting to ePub or Mobi in Calibre after using ereader2html on it. It worked perfectly.

Using the GUI, I had tried specifying cp1252, which produced funny characters in front of the quotes and em-dashes, and I tried not specifying anything, which removed the quotes and em-dashes altogether. When I did it from the command line, using the syntax the previous poster mentioned, I got perfect output.

Now I have what is probably a stupid question. One of the many things I love about Calibre is being able to have all formats of the book show up in one place. (Meaning, when I click on the title in Calibre, I can see which formats of it I have and open any of them.) When I use the command line to produce the epub format from the html format, and then use the Gui to add the book to the library, it sees it as another title with the same name, rather than as an alternate format of the same title. (Am I making sense?) How can I make that right?
Statch is offline   Reply With Quote
Old 07-26-2009, 07:56 AM   #15
Statch
Connoisseur
Statch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it isStatch knows what time it is
 
Statch's Avatar
 
Posts: 95
Karma: 2084
Join Date: Aug 2008
Location: Georgia, USA
Device: Kindle PW2, Samsung Galaxy 3, Kindle Fire HD
Just a note to say I did also try converting the ereader2html output to .mobi first using Mobipocket Creator, as mjmcleod suggested, and it also worked perfectly.

I should also specify that I've been having this problem with all books I've used ereader2html on, and they are just standard run-of-the-mill books published by large publishing houses.
Statch is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
converting sony books or B&N books for ipad? cavi General Discussions 2 04-25-2010 11:45 PM
Converting to Palm Digital Ereader rocojo Calibre 5 12-27-2009 08:31 AM
Converting Fictionwise's Secure eReader to something my 505 will read RWJ Calibre 12 09-11-2009 04:33 PM
converting long, somewhat complex docs to eReader Richard Maseles Other formats 4 01-07-2009 05:28 PM
Converting books to eReader with Dropbook Robotech_Master Workshop 1 12-23-2008 12:46 PM


All times are GMT -4. The time now is 04:05 AM.


MobileRead.com is a privately owned, operated and funded community.