Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 01-26-2010, 03:04 PM   #16
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by charleski View Post
The simplest is to install Sigil. Just open the epub in Sigil and save it, don't do anything else. Sigil strips out a lot of code that isn't strictly necessary, and will also strip out the faulty encoding parameter.
Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.
Valloric is offline   Reply With Quote
Old 01-26-2010, 09:10 PM   #17
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Quote:
Originally Posted by Valloric View Post
Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.
Sorry, but while Sigil is a very useful program, it (or HTML Tidy) engages in code refactoring with certain assumptions that results in some elements being lost. In the vast majority of cases this has no impact, or (as here) results in errors being automatically fixed.

But it has the possibility to introduce errors. I've attached 2 tiny epubs I made to demonstrate this.

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind. For Western languages this isn't a issue, and in fact the use of UTF-8 should be encouraged - there's no reason for people to be using ancient ANSI encoding in epubs. But it might be a problem for those who need to use UTF-16.

Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.
Attached Files
File Type: epub Sigil test Original Ansi.epub (1.9 KB, 275 views)
File Type: epub Sigil test opened in sigil.epub (1.9 KB, 265 views)
charleski is offline   Reply With Quote
Advert
Old 01-27-2010, 08:13 AM   #18
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by charleski View Post
This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.
Sigil doesn't just "change" the encoding attribute and then pretend everything will work right. Come on. It tries to recognize the original encoding of the file and convert the text to UTF-8. After some recent changes, it's actually become pretty successful at this.

Quote:
Originally Posted by charleski View Post
I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind.
That's patently false. Read this blog post.

But bugs are always possible. I'll check what's going on with this file and report back.

In general, you should report any problems on the tracker so they get scheduled and fixed. You're not doing anyone any favors (least of all yourself) by not reporting bugs. I've heard people feel like reporting bugs or missing features is "undue criticism": couldn't be farther from the truth. The more bug reports, the better.

Quote:
Originally Posted by charleski View Post
Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.
Actually this is something that's being worked on. Sigil should preserve your custom metadata, I completely agree. See this thread for the metadata discussion.
Valloric is offline   Reply With Quote
Old 01-27-2010, 08:31 AM   #19
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by charleski View Post
This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.
And there's your problem.

Try to directly load the XHTML in the first epub file. The accented "e" is preserved, since the encoding is correctly detected and the files converted to UTF-8, just like I said.

If you load the epub file, the accented "e" becomes a question mark. Why?

Because what you have is an ISO-8859-1 encoded XHTML file inside and epub file, and that's against the epub specification. The only encoding allowed in XHTML files present in the epub specification are UTF-8 and UTF-16. You are not allowed to use something else (like ISO-8859-1).

Quote:
Originally Posted by OPS Specification
1.4.1.2: XHTML Content Document Requirements

A conformant XHTML Content Document must meet these conditions:
  1. it is a well-formed XML document (as defined by XML 1.1); and
  2. it is encoded in UTF-8 or UTF-16; and
  3. it is a valid XML document according to the NVDL schema interaction provided in Appendix A; and
  4. it has a MIME media type of either application/xhtml+xml or text/x-oeb1-document (deprecated); and
  5. all XHTML elements and attributes not contained in an Inline XML Island are drawn from the XHTML subset identified in this document.
So your file is bad. Sigil is doing the correct thing by assuming the XHTML files in the epub will be either UTF-8 or UTF-16.

But I'm going to change that. I'm going to perform the same encoding detection analysis on XHTML files in the epubs as I do when an (X)HTML file is loaded directly. Why? Because someone not familiar with the epub spec will do the same thing you did and expect everything to work. Sigil should be able to detect this error and correct it, as it can for markup.

And it will, next version onwards.

EDIT: This is now in trunk.

Last edited by Valloric; 01-27-2010 at 09:28 AM.
Valloric is offline   Reply With Quote
Old 01-27-2010, 08:43 PM   #20
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Just calm down Valloric. This is getting ridiculous. As far as I'm concerned, this behaviour is a feature, and I presented it as such.

If you want to fix the issue with stripping metadata, fine, but I'd regard that as a very low priority. And yes, you're right about the spec, the file passed epubcheck 1.0.4, but that version contains a bug that doesn't spot encoding errors (which has been fixed, but not released, doh).

I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.

Both my points were correct, but you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.
charleski is offline   Reply With Quote
Advert
Old 01-28-2010, 07:14 AM   #21
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by charleski View Post
Just calm down Valloric. This is getting ridiculous.

...

you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.
Umm... aggressive? I certainly didn't mean to be. I'm perfectly calm. If I came across as aggressive or argumentative, you have my heartfelt apologies, honestly.

But I've reread my responses now a few times and I still can't see any "aggression".

Quote:
Originally Posted by charleski View Post
Both my points were correct
I can't agree here.

Quote:
Originally Posted by charleski View Post
I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.
Well create an issue on the tracker for that (and attach the epub). I'm deeply interested, UTF-16 should be working just fine. I'm using a Qt function for loading Unicode encoded text files in the current version, and this loads all UTF variants without problems... at least the last time I tested it.
Valloric is offline   Reply With Quote
Old 02-01-2010, 07:01 AM   #22
mag1
Junior Member
mag1 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2010
Location: Hampshire, United Kingdom
Device: Sony Reader PRS505
Hi guys,

Thank you for all your comments and assistance. It is very much appreciated. However these replies just go to illustrate, in my mind at least, that we still have a long way to go before e-books are a reliable medium suitable for anyone. If I've understood the problem correctly the issue seems to be with the encoding, so presumably no matter where I buy the book from it will have the same encoding errors which stem from the master copy put out by the publisher?

I would imagine that most people who buy e-books just want to read them. The fact that it becomes necessary to dismantle the files, download and install various other programs to overcome the DRM and correct the errors, and then reassemble the files seems vastly overcomplicated. The publisher should simply acknowledge their errors and supply the customer with a correctly formatted file. Waterstones, the online bookstore I bought the novel from were hopeless when it came to providing technical support. In fact after several weeks of waiting for a reply all they did was reimburse my money without a single word of explanation. So I still have no idea if it would be safe to download again either from them or from another website. Contrast this with how easy it is to exchange a book in a real shop which is done in minutes.

Mcmillan clearly have a quality control problem if they can release books where every page has encoding errors. I think my next step will be to write to Mcmillan and explained the situation and see what their response will be (if any). I will then return to this thread and update it with their reply.

Kind regards

Mark
mag1 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Light/thin character block on opening a book trampas Amazon Kindle 4 09-15-2010 01:29 AM
Strange text in homemade theme ArchCarrier PocketBook 9 03-26-2010 07:48 PM
Strange behaviour of TOC for one character paulpeer Calibre 6 03-07-2010 12:03 PM
Strange pagination in 1 book in Stanza ChristopherTD Apple Devices 3 11-25-2009 02:59 AM
Strange Book Designer Problem dordale Workshop 2 01-16-2009 08:53 AM


All times are GMT -4. The time now is 11:38 AM.


MobileRead.com is a privately owned, operated and funded community.