View Full Version : epub edited - now it's messed up


NASCARaddicted
08-16-2010, 11:10 AM
Hello everybody

I have a strange, new (new for me) problem with an epub file.

Now, before I start, I just want to say: In the past, I edited many xhtml files and turned them to epub with Calibre, that was no problem. But this problem is different ...

I've got this epub file from a "friend" ... However, I noticed, that there are 2 things that I want to change. In Germany (and Austria), quotation marks look like this: text. In Switzerland however (or at least in the part where they speak german) they use them like this:text
Since this is very uncommon for me, I wanted to change that. Also I noticed that many times, there have been like 3-4 empty spaces between two words - in a normal sentence. Normally, there should be only 1 empty space.

So I unpacked the epub file and edited all the html files (about 65) with notepad ++. I changed the quotation marks and I replaced all double empty spaces with 1 empty space.

Also I edited the css file to add justification. And I noticed that some p tags in the css included font size. I think font size should be selected by the reader, so I removed that ,too. Then I packed the file again and opened it with the ebook reader of calibre. Everything looked fine.

Then I put it on my ebook reader ...

The first chapter starts normal with the headline. On the right side, you see the ADE page number 6 (that fits, because there are some other pages before the chapter 1 starts.
But then it becomes strange. After the headline, there is the first line - and a new ADE page number. And right below that, there is another ADE page number.

I took another look at the unedited epub file, and that one looked allright ...

So, what do you think, what did I do wrong ? Could it be the toc file ? Or because I removed the double empty spaces ? Or the css stuff ?

Thanks for your help.

charleski
08-16-2010, 01:42 PM
I've encountered a very similar problem and found that it was caused by Notepad++ failing to detect the encoding properly. If it edits the file in ANSI mode then it will insert codes causing strange behaviour in ADE. Open the source files again and check to see if Notepad++ is recognising them as utf-8. If not, convert to utf-8 before editing.

NASCARaddicted
08-16-2010, 10:33 PM
Hello Charleski

I just opened one of the html files with notepad ++. The encoding is Ansi as UTF8 (or UTF8 without BOM, as is stated in the Menu). So this could really be the problem. I think I will convert the first few files (I don't want to convert ALL files, before I know that it is worth the work) to UTF8 and see if that works (but not right now, maybe later today).

Thanks so far.

Also I noticed that the "end of the line" is set to Unix. Usually I use Windows. But I am not sure if that is a problem.

Dark123
08-16-2010, 11:01 PM
I have noticed that, some things such as — in the original xhtml file when getting converted by Calibre does not get converted properly. For this reason I need to use — this however gets converted by Calibre and works fine.
For you, try using
» for
and
« for

Edit; I have tested it and using the HTML Code of it works. For more HTML code go to http://www.ascii.cl/htmlcodes.htm

charleski
08-17-2010, 06:18 AM
To make sure Notepad++ uses utf-8 correctly, go to Settings->Preferences->New Document tab. Set New Document Encoding to UTF-8 without BOM and check 'Apply to opened ANSI files'.

NASCARaddicted
08-17-2010, 09:52 AM
I have noticed that, some things such as in the original xhtml file when getting converted by Calibre does not get converted properly. For this reason I need to use — this however gets converted by Calibre and works fine.
For you, try using
» for
and
« for

Edit; I have tested it and using the HTML Code of it works. For more HTML code go to http://www.ascii.cl/htmlcodes.htm

I always use html code for special characters, like – » and « because a normal german keyboard doesn't have this keys. The original document however used the special signs directly. The only special characters that I use directly are the german umlauts , because they are on every german keyboard.

@Charleski thanks for your hint. Everyday, I learn something new. And when I think how I started - I had absolutely no knowledge about html. And in the beginning, I used the normal notepad that comes with Windows. I still remember, when I changed german umlaus from their html code to the direct character, the search and replace on notepad took like 20 seconds for 4000 replacements. With Notepad ++ it takes like 5 seconds ... I love this program.

Dark123
08-17-2010, 11:13 AM
I always use html code for special characters, like – » and « because a normal german keyboard doesn't have this keys. The original document however used the special signs directly. The only special characters that I use directly are the german umlauts , because they are on every german keyboard.

@Charleski thanks for your hint. Everyday, I learn something new. And when I think how I started - I had absolutely no knowledge about html. And in the beginning, I used the normal notepad that comes with Windows. I still remember, when I changed german umlaus from their html code to the direct character, the search and replace on notepad took like 20 seconds for 4000 replacements. With Notepad ++ it takes like 5 seconds ... I love this program.
Change it in the .html files in the ePub, it should work. Load the original and change to the html code and just put it back, it should display it fine.

NASCARaddicted
08-17-2010, 11:14 AM
just to keep you updated:

I took the first 5 html files and looked at the encoding: they were all ansi as utf. I konverted them to utf8 and repacked the files. Then I put it on my ebook-reader - but it still doesn't work.

Maybe I have to convert all the remaining html files.

Dark123
08-17-2010, 08:33 PM
just to keep you updated:

I took the first 5 html files and looked at the encoding: they were all ansi as utf. I konverted them to utf8 and repacked the files. Then I put it on my ebook-reader - but it still doesn't work.

Maybe I have to convert all the remaining html files.

Don't use UTF8. You need to convert them to UTF-8 without BOM it's under Encoding in Notepad++. Try that and see if it helps.

NASCARaddicted
08-18-2010, 10:27 PM
now, before I become totally confused:

"utf8 without bom" is the same as "ansi as utf8" right ?

Because, when I open one of the files in notepad in the right corner it says ansi as utf8, but under encodings, it says utf8 without bom.

So it seems as if the files are already coded as utf8 without bom

By the way, the epubs that I created with calibre are all based on utf8 files. Should I change them to uft8 without bom ? I googled and all I found was the recommendation to use utf8. No page ever mentioned bom.


Oh, but to throw that in: I know the base source file for epub have to be xhtml 1.1 valid. So I guess the same appeals to the html files in the epub ? Because I just opened one with a validator and it gave me 8 errors.

The most interesting 2: The <!DOCTYPE> tag is missing

And something is also wrong with the media content.

I just looked at the basic source html file that was also included and it is also missing there. So I let the validator check the source file and within 3300 lines, it found 106 errors. Some of them are very strange, like: it used div style when it should use div class ....

I guess that is the main problem. I expected the whole thing xhtml valid ... so when it comes to epubs, never rely on others using valid xhtml files.

Dark123
08-19-2010, 07:44 AM
Sorry I meant, extract the HTML files from the ePub and convert them to UTF-8 without bom, otherwise some of the characters do not display on my eBook reader (I tested it)
I think it would be a lot better if you added the ePub into Calibre, and then click Convert and in the ePub Output (on the left side) click it and then tick, "Do not split on page breaks" and set the Split files larger than 999999 KB.
This way Calibre will convert it to into an ePub but it will only have 1 HTML file in there. You can now make this an original and edit it the way you like it and then use Calibre to make it into an ePub afterwards.

charleski
08-19-2010, 08:27 AM
now, before I become totally confused:

"utf8 without bom" is the same as "ansi as utf8" right ?

Because, when I open one of the files in notepad in the right corner it says ansi as utf8, but under encodings, it says utf8 without bom.

So it seems as if the files are already coded as utf8 without bom

By the way, the epubs that I created with calibre are all based on utf8 files. Should I change them to uft8 without bom ? I googled and all I found was the recommendation to use utf8. No page ever mentioned bom.

I wouldn't get too worried about the BOM. If it says 'UTF-8 without BOM' under the Encodings menu then you're fine. ePub readers shouldn't need a BOM (which is a bit archaic) anyway.


The most interesting 2: The <!DOCTYPE> tag is missing

And something is also wrong with the media content.

I just looked at the basic source html file that was also included and it is also missing there. So I let the validator check the source file and within 3300 lines, it found 106 errors. Some of them are very strange, like: it used div style when it should use div class ....

You'll want the top of each xhtml file to have

<?xml version="1.0" encoding="utf-8" standalone="no"?>

at the very least, and more properly

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

But if there are strange errors in the source then that might be the root of your problems. Calibre isn't very strict about checking the syntax, which is why I always use Sigil instead.

NASCARaddicted
08-19-2010, 09:38 AM
But if there are strange errors in the source then that might be the root of your problems. Calibre isn't very strict about checking the syntax, which is why I always use Sigil instead.

Yeah. I noticed that in the past. Small errors can cause big problems.

When I started with epub, I knew nothing about html. xhtml ? css ? valid files ? I never heard stuff like that.

I used mobipocket to convert pdf files to html, then I used Calibre to convert them to epub. I put them on my ebook reader - and to my surprise, some lines where longer then the screen itself. So half a word, or maybe even 1 or 2 words were missing. In the beginning, I didn't knew what was wrong, so I took a closer look at the html file and I noticed that <p> tags and <br> tags where totally mixed up. I replaced all the p tags with br and in the end, the problem with the "too-long" lines was gone, but of course, the book didn't look good. I mean, no text intent, and the text was not justified. When I found out what you can do with p tags, it got better - and then I learned how important validity is when it comes to xhtml files

@ebooknewbie: thanks for your hint about how to create epubs with just 1 html file. I know, in generell, split html files are no problem, but when you have to edit them ... you helped me a lot :-)

Dark123
08-19-2010, 12:04 PM
@ebooknewbie: thanks for your hint about how to create epubs with just 1 html file. I know, in generell, split html files are no problem, but when you have to edit them ... you helped me a lot :-)

It's nothing. Hopefully it'll help you fix the problem, that you're having.
I know how you feel about ebooks showing weird, I had to learn a bit of CSS and XML (I knew HTML). My worse was the header (Chapter 1) and then half the ebook reader taken up by <br> tag.