View Single Post
Old 09-16-2022, 01:42 AM   #26
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by stumped View Post
Meanwhile yes I can tweak my workflow. It's not understanding what's happening at code level that bugs me.
If you open an XHTML (or HTML) file in Sigil, at the very top, you'll see something like this:

Code:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
That DOCTYPE is the code that is being added.

- - -

Note: In your very specific Modify EPUB example, there's a single file—the new cover file—that's missing this DOCTYPE.

- - -

What's That DOCTYPE Code Saying?

"Hey! I'm written in XHTML 1.1."

Programs can use this to determine what the heck is in the file, instead of trying to completely guess.

Reason 1: Rules

This then sets up some "rules" the program can follow. For example:
  • In XHTML, everything that opens <p> needs to close </p>.
  • In HTML, you can have super ugly stuff like <p> <p> <p> without ever closing it.

If the DOCTYPE is missing, the program will have to guess based on filenames... or just assume it's HTML.

- - -

Sigil, when opening and popping up that warning, is saying:

"Hey! This thing isn't 100% correct according to the specs. Add a DOCTYPE!"

Calibre is saying:

"Meh, it's just HTML or XHTML."

- - -

Reason 2: Named (or Numbered) Entities

"Named Entities" are stuff like:
  • & = &amp;
  • < = &lt;
  • > = &gt;

"Numbered Entities" are stuff like:
  • “ = &#8220;
  • ” = &#8221;
  • † = &#8224;

In EPUB2, both versions are allowed.

In EPUB3, only the numbers are allowed.

What Sigil/Calibre do, when fixing your files, is:

1. Convert all entities into their proper Unicode characters.
2. In EPUB3, convert to the numbered version if needed.

So everything changes into Unicode Characters:
  • &quot; → "
  • &ldquo; → “
  • &rdquo; → ”
  • &dagger; → †

And only a few change to numbered form. The most famous one is:
  • &nbsp; → &#160; = Non-Breaking Space

That's pretty much the major code changes that Sigil/Calibre do when you say "Fix my files."

- - -

Note: If you want to read even more of the technical details on DOCTYPE, see the HTML specs:

- - -

Side Note: There's also been tons of topics over the years where we've discussed stuff like:
  • Differences between HTML vs. XHTML
  • Named vs. Numbered Entities

Just type into your favorite search engine:

Code:
whatever term you want Tex2002ans site:mobileread.com
and you'll probably stumble across all those topics over the years.

Quote:
Originally Posted by stumped View Post
Maybe I should read an idiots guide to doc types and why they matter, or don't matter , in epub?
Just push the button and trust Sigil. :P

It's like going from 99.999% correct to 100% correct.

KevinH (+ Sigil) says follow the specs.

Kovid (+ Calibre) says that .001% doesn't matter in reality.

(This is all going based off memory. It's buried somewhere in those previous DOCTYPE MobileRead topics.)

Last edited by Tex2002ans; 09-16-2022 at 02:05 AM.
Tex2002ans is offline   Reply With Quote