MobileRead Forums - View Single Post - this epub has HTML files that are not well formed or are missing a doctype..

Tex2002ans · 09-16-2022, 02:42 AM

Quote:

Originally Posted by stumped

Meanwhile yes I can tweak my workflow. It's not understanding what's happening at code level that bugs me.

If you open an XHTML (or HTML) file in Sigil, at the very top, you'll see something like this:

Code:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

That DOCTYPE is the code that is being added.

- - -

Note: In your very specific Modify EPUB example, there's a single file—the new cover file—that's missing this DOCTYPE.

- - -

What's That DOCTYPE Code Saying?

"Hey! I'm written in XHTML 1.1."

Programs can use this to determine what the heck is in the file, instead of trying to completely guess.

Reason 1: Rules

This then sets up some "rules" the program can follow. For example:

In XHTML, everything that opens needs to close .
In HTML, you can have super ugly stuff like without ever closing it.

If the DOCTYPE is missing, the program will have to guess based on filenames... or just assume it's HTML.

- - -

Sigil, when opening and popping up that warning, is saying:

"Hey! This thing isn't 100% correct according to the specs. Add a DOCTYPE!"

Calibre is saying:

"Meh, it's just HTML or XHTML."

- - -

Reason 2: Named (or Numbered) Entities

"Named Entities" are stuff like:

& = &
< = <
> = >

"Numbered Entities" are stuff like:

“ = “
” = ”
† = †

In EPUB2, both versions are allowed.

In EPUB3, only the numbers are allowed.

What Sigil/Calibre do, when fixing your files, is:

1. Convert all entities into their proper Unicode characters.
2. In EPUB3, convert to the numbered version if needed.

So everything changes into Unicode Characters:

" → "
“ → “
” → ”
&dagger; → †

And only a few change to numbered form. The most famous one is:

  →   = Non-Breaking Space

That's pretty much the major code changes that Sigil/Calibre do when you say "Fix my files."

- - -

Note: If you want to read even more of the technical details on DOCTYPE, see the HTML specs:

HTML5 > 8. The HTML syntax > 8.1.1 The DOCTYPE

- - -

Side Note: There's also been tons of topics over the years where we've discussed stuff like:

Differences between HTML vs. XHTML
Named vs. Numbered Entities

Just type into your favorite search engine:

Code:

whatever term you want Tex2002ans site:mobileread.com

and you'll probably stumble across all those topics over the years.

Quote:

Originally Posted by stumped

Maybe I should read an idiots guide to doc types and why they matter, or don't matter , in epub?

Just push the button and trust Sigil. :P

It's like going from 99.999% correct to 100% correct.

KevinH (+ Sigil) says follow the specs.

Kovid (+ Calibre) says that .001% doesn't matter in reality.

(This is all going based off memory. It's buried somewhere in those previous DOCTYPE MobileRead topics.)