(x)html ebook specification

rogue_ronin · 05-18-2009, 01:17 AM

I'd like to get some brains on this subject.

I'm (slowly) assembling and editing a giant library of HTML ebooks. I've been using an idiosyncratic mix of HTML 3.2 and XHTML that I've picked up over the years.

I use an obsolete reader -- but a very functional one! (The REB1100.) I'm looking to upgrade, though, soon. And I'd like to do this only once, the editing/organizing.

I use an awesome text editor, NoteTab Pro, which lets you assemble libraries of macros -- anything you can do to text, you can do with the macros, it's got an enormous language. So I've built a library with a hundred or two macros, that do everything from regex to boilerplate to file manipulation and database entries.

So I need some advice on a better 'spec' for the format -- I should be able to rewrite the macros to the new one, and write a few that auto-adapt the old stuff I've done already. Creating macros to write CSS for any reader should be dead-simple, or writing converters to straight HTML, also -- once the format is set and consistent. There are a lot of ideas out there, and I have my own, which I'll start with:

This spec should use XHTML, and CSS. But the document markup should be as simple as possible.

Here are the elements that I think are important in an ebook, primarily fiction books -- a mix of meta-data and structure -- the meta is often explicitly expressed in the book: please add on if I've missed something.

Book Meta: Author(s), Illustrator(s), Publisher, ISBN, Publishing Date, Publishing City, Copyright Owner, Copyright Date, Series Name, Title, Sub-Title

File Meta: Version Number, Version Date, Original Conversion Date, Scanner, Proofreader(s), Original Source

Structure: Cover, Front Matter, Title Page, Verso Page (book meta info page), Inscription, Acknowledgments, Preface, Foreword, Table of Illustrations/Maps, Table of Contents, Prologue, Parts, Chapters, Epigrams, Sections, Sub-Sections, Paragraphs, Epilogue, Afterword, Endnotes, Glossary, Index, End Matter

If I've missed anything, please add or suggest. In my next post, I'm going to add my current methods, and ask for advice on improvements.

Thanks for reading!

m a r

pepak · 05-18-2009, 01:54 AM

Quote:

Originally Posted by rogue_ronin

Creating macros to write CSS for any reader should be dead-simple,

You don't need macros for that - as long as your (X)HTML files are created with styling in mind, the CSS will be a sompletely standalone part of your book - likely you will have most CSS for all books in one file, with perhaps an additional "supporting" CSS file for particular books. The whole point of CSS is, if you want to change the appearance of a HTML, you make your changes only in the separate CSS file, nowhere else.

As for organizing everything neatly, you may want to look at my "Calibre preprocessor" H2LRF.

rogue_ronin · 05-18-2009, 06:50 AM

Read your thread/link. It's funny, I already do something like that, too, with all the meta-data. More than what you've got. Anal-retentively more.

'Course, I'm not using Calibre (yet.) And the utility I use is my text editor, not a pre-processor to call Calibre... I think what I'm shooting for here is a utility/software/reader neutral way to present an ebook in simplest HTML -- consistency being the key. Then anyone can take it and convert it pretty easily. Heck, you could use it for your h2lrf, almost without effort -- just a minor change to the meta you look for.

As for the CSS macros -- I think what I'm talking about is that each ebook reader (hardware) should probably have its own CSS, right? I mean, what looks good on a 5" JetBook, probably doesn't look as good on an 11" DS1000. So you'd want to answer a few questions in a dialog (well, I would) about what sort of reader you're trying to make an EPUB for. Then, boom, CSS created. Maybe it calls some common defaults in a common.css file or such.

Now that I re-read my thoughts above, I think I'm talking about two separate things. Common, single CSS will be all that's necessary. Later in the process, when I want to make a package for a hardware reader, then an additional CSS macro might be necessary. You're right, as usual pepak.

Anyway, gonna dig out my old Barsoom folder, and grab A Princess of Mars to use as an example for the next stuff.

Back later,

m a r

JSWolf · 05-18-2009, 09:38 AM

Since you are coding XHTML, why not have a look at ePub? It would do quite well for you (IMHO).

rogue_ronin · 05-18-2009, 11:01 AM

Yeah, I can't seem to find a clear set of guidelines/tutorial for ePub -- they all (so far) seem to assume a level of familiarity with XML that I don't have. I'm never happy if I just mimic without understanding -- and that takes a while.

I think I might use this thread to teach myself how to do it (ePub) properly, and modularly, with great metadata, and good in-book navigation. Just do it a piece at a time, and hope folks chime in when I'm screwing up.

It's the XML spine, etc. where I start to get truly lost. 'Course, if I figure it out once, I can just macro the heck out of it.

m a r

DaleDe · 05-18-2009, 12:39 PM

The wiki can help in your research. It has most of the topics you have expressed interest in and can provide a starting point. If you find any deficiencies you can correct them! or ask for help.

Dale

pepak · 05-18-2009, 04:34 PM

Quote:

Originally Posted by JSWolf

Since you are coding XHTML, why not have a look at ePub? It would do quite well for you (IMHO).

You can convert XHTML to anything easily. You can't do that with EPUB, even though it uses XHTML as its basis (e.g. with EPUB your converter needs to be able to handle multiple source files).

gwynevans · 05-18-2009, 07:36 PM

The way I've been doing it when creating ePubs has been to just create the XHTML as below and run it through Calibre to create the actual ePub, as that way I can work with a single source but use Calibre to do the file-splitting & 'twiddly bits' to create a valid ePub.

rogue_ronin · 05-19-2009, 02:42 AM

Okay, yeah, after being swayed by the wind, I'm back in the simple (x)HTML camp. I want to produce single-file ebooks (other than images/sounds, of course.)

So, let me get started: I dug out my old file of A Princess of Mars (it's in the public domain) and updated it -- it was still in my old format (easy but not trivial to fix.)

I'm going to go through it a piece at a time (I won't post entire chapters, just relevant stuff.)

Here's the current start of the file, through the head:

Code:

<html>

<head>

<!-- Conversion Started May/20/2004 -->
<!-- Revision # 0.80 on May/18/2009 -->

<!-- META INFO USED BY THE REB1100 FOR DISPLAY ON THE ABOUT PAGE -->

<title>A Princess of Mars</title>
<meta name="author" content="Burroughs, Edgar Rice">
<meta name="publisher" content="Found Text">
<meta name="genre" content="Science Fiction::General">
<meta name="ISBN" content="Found Text: #0085 v. 0.80">

<!-- META INFO USED BY THE REB1100 NOTETAB CLIPBOOK -->

<meta name="theme" content="Negative">
<meta name="number" content="0085">
<meta name="name" content="APrincessOfMars">
<meta name="version" content="0.80">
<meta name="title" content="A Princess of Mars">
<meta name="subtitle" content="Barsoom #01">
<meta name="series" content="Barsoom">
<meta name="seriesnumber" content="01">
<meta name="authorlast" content="Burroughs">
<meta name="authorfirst" content="Edgar">
<meta name="authormiddle" content="Rice">
<meta name="authorfull" content="Burroughs, Edgar Rice">
<meta name="rebgenre" content="Science Fiction::General">
<meta name="conversiondate" content="May/20/2004">
<meta name="source" content="University of Virginia Electronic Text Center">
<meta name="scanner" content="Judy Boss">
<meta name="proofer" content="Kelly Tetterton, Peter-John Byrnes, Found Text">
<meta name="revisiondate" content="May/18/2009">
<meta name="shortpath" content="REB1100\eBookProjects\Found Text\Burroughs_Edgar_Rice\Barsoom\01_APrincessOfMars\APrincessOfMars.html">

<!-- ENDNOTE COUNT -->

<meta name="endnotecount" content="0">

<!-- GOTO MENU -->

<meta name="rocket-menu" content="Table of Contents=#toc">
<meta name="rocket-menu" content="About this Version=#verso">

</head>

I looked at gwynevans source code, which is sweet-and-clean:

Code:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<HTML>
<HEAD>
    <META NAME="Author" CONTENT="Konrath, J.A"/>
    <style type="text/css">
        <!--
        p { text-indent: 1em; margin: 0; }
        h1 { page-break-before: always; font-style: italic; }
        div.drink { page-break-before: always; }
        div.drink p { text-indent: 8em; }
        -->
    </style>
</HEAD>

Now, of course, my stuff has a bunch of kludges in there, specific to the REB1100. But it also has some good stuff, I think. And I have a few comments about gwynevans (not critical of you, gwynevans, everyone has their own preferences) to use as a jumping off point.

Let's start with a couple of questions:

1) Is there a reason to prefer XHTML 1.0 over XHTML 1.1?

2) How to choose the encoding? ie: what's best, most universal, least hassle? I'm an english speaker (my thai is terrible! my japanese has faded, my french really sucks, but I can read a little of all of them -- I could be talked into some single way of seeing everything, as long as it won't hamper sharing, or over-complicate.)

3) gwynevans -- did you whip this out as an example, or was it something you have? I ask because it has no title, for instance.

4) I'm thinking to move all CSS to a separate file, any reason I shouldn't? (BTW, the page-break-before: always; part -- is that specific to ebooks, or part of XHTML? 'Cause I've been wondering about how to hard-code that.

And now to rip apart my gunk:

A) It has no DTD or -- what is it called when you specify html vs xml, etc?

B) For readability of the source (something that is very important to me) I keep a lot of sectioning with vertical space. I see (and have seen elsewhere) horizontal tabbing as a visual aid. Any good reason to prefer one over the other? Or to not combine them? I understand why the tabbing is there, but glancing at a page it is hard to find related sections -- they just don't stand out. And when there are a lot of sections to a document, I find that I have to right-scroll a lot, or that word wrap wrecks the layout.

C) Lower case tags: correct usage for XHTML, right?

D) Version information in comments -- I think this is a good practice, but for sharing would there be a better method? For instance, I don't name where to find the history/source, or what the numbers mean. I do have a set of guidelines for the numbers, should I include the guidelines? Or something else? It's going to be repeated later, but isn't it nice to open a file, and see the version, boom!, right there? Should I keep a list of all updates, instead of just first and latest?

E) REB1100 meta-info: well the <title> has to stay! And the next four <meta> tags are staying too, I think. (The ISBN tag is one I hijacked to display the collection number -- it would simply be returned to its original function.) Just gonna get rid of the comment, and merge the <meta> tags into a more general Meta section. Reasonable, right?

F) NoteTab ClipBook meta-info: For the goals of this thread, I am certain these <meta> tags are mixed up, and that as a pure source file, some should not be there. Just a comment on the meta/genre tag -- I found a collection of what looked like standard, official book-seller classifications online, and I wrote a macro to give me a drop-down list of genres. So it's got some universal sensibility, not just my personal conceits.

i) <meta name="theme"... : I have sets of icons and images that are used as links (ie: next chapter, previous chapter, toc, an end-of-book image that "closes" the book by linking to the cover, etc.) This just names it for the macro-library, and could be used to repair a broken folder, although I don't do that now. It should go, it's not necessary. But I still want to include such images!
ii) <meta name="number"... : this is a project number for the "publisher" , in this example named Found Text. The "publisher" is just a conceit, but it makes it convenient to group books, by genre or author, whatever. Still, not necessary? Or could it be adapted into a DocumentID, UID or something like that?
iii) <meta name="name"... : Both the filename(.html) and the project name in my database/filetree. I don't know. I can sort of see this either way. Useful? Or redundant? Need a better name itself? I always camel-case the title of the book and remove spaces.
iv) <meta version, title, subtitle, series, seriesnumber... : I think these stay. Can't see any reason not to have them. Maybe they need better names?
v) <meta authorlast, authormiddle, authorfirst, authorfull ... : Why do I not just use "author"? Well, sometimes you need to manipulate the name for display, other times you need to sort by last name. I think FBreader, for one, let's you choose the sorting tag. Also, the macros I write let me collect names as I add them, so why have to re-type "Edgar" when I add Poe to my collection? I could see adding an author <meta>, or re-jiggering authorfull, and changing the name of the current authorfull to something else (author-by-last, authoralphabetical?) My current rule is when you have an initial, you don't use a period. I let the macros sort it out. But that may not be best. Does this all make sense? Regardless, I don't think a simple <meta name="author" content="Edgar Rice Burroughs"> is enough.
vi) <meta rebgenre ... : It's just here because I wanted to keep the REB1100 functional stuff separated from the NoteTab Clip stuff. It made it a lot easier to parse the file when updating. Redundant, gone.
vii) <meta conversiondate, source, scanner, proofer, revisiondate ... : All necessary, I think. Maybe need better names? For example, should conversiondate, be initialconversiondate? or xhtmlconversiondate? Or something else, others?
viii) <meta shortpath ... : this is specific to the filetree, and makes setting things up in the macros easier. Unnecessary. Gone.
ix) The entire endnote part should go, I think. I keep the number to allow for adding new endnotes as the text is processed. It's really just a backup (in a way, all these meta-info are, given that I keep it all in a database, too.) Gone.
x) GOTO-MENU section: Gone. This is a section for the REB1100 (and specifically for rbmake, I believe); it allows for up to seven (that's right, folks, a whole seven!) pop-up TOC links.

So what do you think? I know this is a super-long post: don't feel that you have to respond to everything, just whatever you think is good or bad. I'm looking to develop a best-practice here, and I don't see a lot of discussion about embedding meta-info.

Thanks for reading,

m a r

gwynevans · 05-19-2009, 05:24 AM

> 1) Is there a reason to prefer XHTML 1.0 over XHTML 1.1?

None that I know of - I guess I just had the 1.0 header to hand & for this particular use, I don't think there was any difference between 1.0 & 1.1.

> 3) gwynevans -- did you whip this out as an example, or was it something you have? I ask because it has no title, for instance.

At the time I'd not considered custom metadata & pre-processing, so just had a 'build_ePub.bat' in the folder which set some of the metadata via the Caliber command-line, e.g. 'html2epub --margin-right=10 --level1-toc="//h2" --chapter="//h2" --cover="Konrath, J.A - Jack Daniels 01 - Whiskey Sour.png" -t "Whiskey Sour" -a "Konrath, J.A" "Konrath, J.A - Jack Daniels 01 - Whiskey Sour.html"'

> 4) I'm thinking to move all CSS to a separate file, any reason I shouldn't?

If you've come up with a standard set of styles that you want to reuse, then it's worth considering, although the main reason to do so in the web-site case is to allow global changes by editing the one file, which may be less of an issue in this particular usage.

> (BTW, the page-break-before: always; part -- is that specific to ebooks, or part of XHTML? 'Cause I've been wondering about how to hard-code that.

Standard, but it's less well known as it's focussing on the print side of things - http://www.w3schools.com/Css/pr_print_pagebb.asp

pepak · 05-19-2009, 05:41 AM

Quote:

Originally Posted by rogue_ronin

1) Is there a reason to prefer XHTML 1.0 over XHTML 1.1?

XHTML 1.1 is a bit "cleaner" (from the technical point of view), which makes it a bit more restrictive. That's a good thing, IMHO.

Quote:

2) How to choose the encoding? ie: what's best, most universal, least hassle?

If you want "one encoding to rule them all", go with UTF-8. Personally, I use the encoding that is best suited to each book in my OS (e.g. us-ascii for english books, windows-1250 for czech books). It helps with your 'readability of source" - it is far easier to read "â" that a sequence of two special symbols.

Quote:

4) I'm thinking to move all CSS to a separate file, any reason I shouldn't?

You definitely should!

Quote:

(BTW, the page-break-before: always; part -- is that specific to ebooks, or part of XHTML? 'Cause I've been wondering about how to hard-code that.

It's a CSS2 specification - every reader that supports CSS2 should be able to handle it, provided that the display medium can (e.g. page-breaks don't make sense with an "endless" screen of browsers).

Quote:

B) For readability of the source (something that is very important to me) I keep a lot of sectioning with vertical space. I see (and have seen elsewhere) horizontal tabbing as a visual aid. Any good reason to prefer one over the other? Or to not combine them?

It doesn't matter to XHTML - blank space is reduced to one space by a standard-compliant reader. Choose what appeals to you more.

Quote:

C) Lower case tags: correct usage for XHTML, right?

Yes. XHTML requires proper (which tends to be lower) case.

Quote:

D) Version information in comments -- I think this is a good practice, but for sharing would there be a better method?

Put it into metadata. <meta name="version" content="..." />

pdurrant · 05-19-2009, 06:10 AM

Quote:

Originally Posted by pepak

If you want "one encoding to rule them all", go with UTF-8. Personally, I use the encoding that is best suited to each book in my OS (e.g. us-ascii for english books, windows-1250 for czech books). It helps with your 'readability of source" - it is far easier to read "â" that a sequence of two special symbols.

I'm moving to using UTF-8 entirely. But then, I have a good UTF-8 text editor - BBEdit. This means I have access to all of unicode, and see the characters as they should be, and don't have to use the entities. If any particular reader software needs entities rather than UTF-8, the production process can do the substitution. I much prefer editing with “ and ” rather than “ and ”

rogue_ronin · 05-19-2009, 07:07 AM

Okay, thanks for the link, I've started to read the tutorial on CSS.

I've found this example of a 1.1 header elsewhere:

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

I'll use it for now, until I know more.

I understand about using Calibre. That makes a lot of sense as a way to just get it done quickly and adequately.

'Course, I'm goin' all ballistic on this right now...

So, looking at my first post, and my second long post, and a new idea or two here's my proposed start of a new file with good meta-info (using the old data, and faking where necessary):

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>A Princess of Mars</title>

<!-- BEGIN: DOCUMENT HISTORY -->

<!-- Created on 20/May/2004 -->
<!-- Revision # 0.10 on 30/Jun/2004 -->
<!-- Revision # 0.20 on 28/Dec/2004 -->
<!-- Revision # 0.40 on 13/Apr/2005 -->
<!-- Revision # 0.70 on 30/Sep/2006 -->
<!-- Current Revision # 0.80 on 18/May/2009 -->

<!-- END: DOCUMENT HISTORY -->

<!-- BEGIN: EBOOK META INFORMATION -->

<meta name="filename" content="APrincessOfMars">
<meta name="fileid" content="FoundText0085">
<meta name="filecreationdate" content="20/May/2004">
<meta name="fileversion" content="0.80">
<meta name="filerevisiondate" content="18/May/2009">
<meta name="filesource" content="University of Virginia Electronic Text Center">
<meta name="filescanner" content="Judy Boss">
<meta name="fileproofer" content="Kelly Tetterton, Peter-John Byrnes, Found Text">

<meta name="title" content="A Princess of Mars">
<meta name="subtitle" content="Barsoom #01">
<meta name="series" content="Barsoom">
<meta name="seriesnumber" content="01">
<meta name="genre" content="Science Fiction::General">

<meta name="author" content="Edgar Rice Burroughs">
<meta name="authorlast" content="Burroughs">
<meta name="authorfirst" content="Edgar">
<meta name="authormiddle" content="Rice">
<meta name="authoralpha" content="Burroughs, Edgar Rice">

<meta name="illustrator" content="Frank Frazetta"
<meta name="illustratorlast" content="Frazetta">
<meta name="illustratorfirst" content="Frank">
<meta name="illustratormiddle" content="">
<meta name="illustratoralpha" content="Frazetta, Frank">

<meta name="publisher" content="Found Text">
<meta name="publicationdate" content="08/July/2010">
<meta name="publicationcity" content="Honolulu">

<meta name="copyrightholder" content="">
<meta name="copyrightdate" content="">

<meta name="isbn" content="">

<!-- END: EBOOK META INFORMATION -->

</head>

So, a lot cleaner than what I started with. Title, CreationDate and RevisionDate are repeated, but that seems reasonable to me: I think that whatever parser you might use on the meta shouldn't have to make an exception for the title, and I still think that a doc history makes sense at the top.

If there are more than one author or illustrator, append ## to the name attribute: ie, author01, for the 2nd author, illustrator03 for the 4th illustrator. I think 100 authors and illustrators is enough. Let the parser figure it out. Or should I start with author01? Or author00? I don't think so, but...

The only inconsistency this leaves is with the proofer: I can't see a reason though why you might need more than a simple, comma-separated list. Can anyone?

And, I guess, sometimes publishers have more than one city -- but a simple list would do there, too, wouldn't it?

All dates in dd/mmm/yyyy format. Use leading 0's for all numbers less than 10.

I really do appreciate any input to this typing-out-loud, thanks,

m a r

rogue_ronin · 05-19-2009, 07:17 AM

Hmmm, another thought. Does it make sense to include the following two things?

#1: versioning info. Often when I get a file, the version number is basically meaningless. Here's the versioning ranks I use:

Code:

0.10 Initial Conversion
   0.20 Cover and Frontispiece
   0.30 Sections, Chapters and TOC
   0.40 Endnotes and/or Blockquotes
   0.50 Initial Spellcheck
   0.60 Mdashes and Hyphens and Ellipses
   0.70 Italics, Bold, and Pre-Formatted Text
   0.80 Reading Proof
   0.90 Checked against Canonical Source
   1.00 Touched Up and Packaged For Release

#2: structure hints. Should someone have to examine the entire source for a clue to how it's been assembled? Or could you add something like this as a comment?:

Code:

Title: h1 class="title"
Subtitle: h3 class="subtitle"
Chapter: h3 class="chapter"
Paragraph: p class="normal"
Epigram: p class="epigram"
etc. etc.

m a r

pepak · 05-19-2009, 07:21 AM

Quote:

Originally Posted by rogue_ronin

The only inconsistency this leaves is with the proofer: I can't see a reason though why you might need more than a simple, comma-separated list. Can anyone?

Theoretically, and keeping in mind that I do all my books myself and for myself (they are still covered by copyrights, I can't spread them), you may want to contact a proofer or another - so you need an e-mail and possibly some IM. You can add them to the list (into parentheses, or something), but if the amount of data still increases, the list will become even less readable.

Do you have any specific reason why you don't want to use multiple meta's?

Code:

<meta name="proofer" content="Person A" />
<meta name="proofer" content="Person B" />
<meta name="proofer" content="Person C" />

Consistently, you could do all multiple-value metas this way, e.g. for author. Reason being, it makes sense to place a book by A. Smith and B. Johnson both among "Smith books" and "Johnson books". A book can belong to multiple series (e.g. Feist's Daughter of the Empire belongs to Empire Saga as book 1, but can also be considered to belong to Midkemia series as book 4(5)).

Quote:

All dates in dd/mmm/yyyy format. Use leading 0's for all numbers less than 10.

Personally, I prefer the SQL-standard of yyyy-mm-dd, which is easy to sort even in textual form.

05-19-2009, 02:42 AM	#9
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	I think this is a discussion of META! But not a meta-discussion (but this title is!) Okay, yeah, after being swayed by the wind, I'm back in the simple (x)HTML camp. I want to produce single-file ebooks (other than images/sounds, of course.) So, let me get started: I dug out my old file of A Princess of Mars (it's in the public domain) and updated it -- it was still in my old format (easy but not trivial to fix.) I'm going to go through it a piece at a time (I won't post entire chapters, just relevant stuff.) Here's the current start of the file, through the head: Code: <html> <head> <!-- Conversion Started May/20/2004 --> <!-- Revision # 0.80 on May/18/2009 --> <!-- META INFO USED BY THE REB1100 FOR DISPLAY ON THE ABOUT PAGE --> <title>A Princess of Mars</title> <meta name="author" content="Burroughs, Edgar Rice"> <meta name="publisher" content="Found Text"> <meta name="genre" content="Science Fiction::General"> <meta name="ISBN" content="Found Text: #0085 v. 0.80"> <!-- META INFO USED BY THE REB1100 NOTETAB CLIPBOOK --> <meta name="theme" content="Negative"> <meta name="number" content="0085"> <meta name="name" content="APrincessOfMars"> <meta name="version" content="0.80"> <meta name="title" content="A Princess of Mars"> <meta name="subtitle" content="Barsoom #01"> <meta name="series" content="Barsoom"> <meta name="seriesnumber" content="01"> <meta name="authorlast" content="Burroughs"> <meta name="authorfirst" content="Edgar"> <meta name="authormiddle" content="Rice"> <meta name="authorfull" content="Burroughs, Edgar Rice"> <meta name="rebgenre" content="Science Fiction::General"> <meta name="conversiondate" content="May/20/2004"> <meta name="source" content="University of Virginia Electronic Text Center"> <meta name="scanner" content="Judy Boss"> <meta name="proofer" content="Kelly Tetterton, Peter-John Byrnes, Found Text"> <meta name="revisiondate" content="May/18/2009"> <meta name="shortpath" content="REB1100\eBookProjects\Found Text\Burroughs_Edgar_Rice\Barsoom\01_APrincessOfMars\APrincessOfMars.html"> <!-- ENDNOTE COUNT --> <meta name="endnotecount" content="0"> <!-- GOTO MENU --> <meta name="rocket-menu" content="Table of Contents=#toc"> <meta name="rocket-menu" content="About this Version=#verso"> </head> I looked at gwynevans source code, which is sweet-and-clean: Code: <?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <HTML> <HEAD> <META NAME="Author" CONTENT="Konrath, J.A"/> <style type="text/css"> <!-- p { text-indent: 1em; margin: 0; } h1 { page-break-before: always; font-style: italic; } div.drink { page-break-before: always; } div.drink p { text-indent: 8em; } --> </style> </HEAD> Now, of course, my stuff has a bunch of kludges in there, specific to the REB1100. But it also has some good stuff, I think. And I have a few comments about gwynevans (not critical of you, gwynevans, everyone has their own preferences) to use as a jumping off point. Let's start with a couple of questions: 1) Is there a reason to prefer XHTML 1.0 over XHTML 1.1? 2) How to choose the encoding? ie: what's best, most universal, least hassle? I'm an english speaker (my thai is terrible! my japanese has faded, my french really sucks, but I can read a little of all of them -- I could be talked into some single way of seeing everything, as long as it won't hamper sharing, or over-complicate.) 3) gwynevans -- did you whip this out as an example, or was it something you have? I ask because it has no title, for instance. 4) I'm thinking to move all CSS to a separate file, any reason I shouldn't? (BTW, the page-break-before: always; part -- is that specific to ebooks, or part of XHTML? 'Cause I've been wondering about how to hard-code that. And now to rip apart my gunk: A) It has no DTD or -- what is it called when you specify html vs xml, etc? B) For readability of the source (something that is very important to me) I keep a lot of sectioning with vertical space. I see (and have seen elsewhere) horizontal tabbing as a visual aid. Any good reason to prefer one over the other? Or to not combine them? I understand why the tabbing is there, but glancing at a page it is hard to find related sections -- they just don't stand out. And when there are a lot of sections to a document, I find that I have to right-scroll a lot, or that word wrap wrecks the layout. C) Lower case tags: correct usage for XHTML, right? D) Version information in comments -- I think this is a good practice, but for sharing would there be a better method? For instance, I don't name where to find the history/source, or what the numbers mean. I do have a set of guidelines for the numbers, should I include the guidelines? Or something else? It's going to be repeated later, but isn't it nice to open a file, and see the version, boom!, right there? Should I keep a list of all updates, instead of just first and latest? E) REB1100 meta-info: well the <title> has to stay! And the next four <meta> tags are staying too, I think. (The ISBN tag is one I hijacked to display the collection number -- it would simply be returned to its original function.) Just gonna get rid of the comment, and merge the <meta> tags into a more general Meta section. Reasonable, right? F) NoteTab ClipBook meta-info: For the goals of this thread, I am certain these <meta> tags are mixed up, and that as a pure source file, some should not be there. Just a comment on the meta/genre tag -- I found a collection of what looked like standard, official book-seller classifications online, and I wrote a macro to give me a drop-down list of genres. So it's got some universal sensibility, not just my personal conceits. i) <meta name="theme"... : I have sets of icons and images that are used as links (ie: next chapter, previous chapter, toc, an end-of-book image that "closes" the book by linking to the cover, etc.) This just names it for the macro-library, and could be used to repair a broken folder, although I don't do that now. It should go, it's not necessary. But I still want to include such images! ii) <meta name="number"... : this is a project number for the "publisher" , in this example named Found Text. The "publisher" is just a conceit, but it makes it convenient to group books, by genre or author, whatever. Still, not necessary? Or could it be adapted into a DocumentID, UID or something like that? iii) <meta name="name"... : Both the filename(.html) and the project name in my database/filetree. I don't know. I can sort of see this either way. Useful? Or redundant? Need a better name itself? I always camel-case the title of the book and remove spaces. iv) <meta version, title, subtitle, series, seriesnumber... : I think these stay. Can't see any reason not to have them. Maybe they need better names? v) <meta authorlast, authormiddle, authorfirst, authorfull ... : Why do I not just use "author"? Well, sometimes you need to manipulate the name for display, other times you need to sort by last name. I think FBreader, for one, let's you choose the sorting tag. Also, the macros I write let me collect names as I add them, so why have to re-type "Edgar" when I add Poe to my collection? I could see adding an author <meta>, or re-jiggering authorfull, and changing the name of the current authorfull to something else (author-by-last, authoralphabetical?) My current rule is when you have an initial, you don't use a period. I let the macros sort it out. But that may not be best. Does this all make sense? Regardless, I don't think a simple <meta name="author" content="Edgar Rice Burroughs"> is enough. vi) <meta rebgenre ... : It's just here because I wanted to keep the REB1100 functional stuff separated from the NoteTab Clip stuff. It made it a lot easier to parse the file when updating. Redundant, gone. vii) <meta conversiondate, source, scanner, proofer, revisiondate ... : All necessary, I think. Maybe need better names? For example, should conversiondate, be initialconversiondate? or xhtmlconversiondate? Or something else, others? viii) <meta shortpath ... : this is specific to the filetree, and makes setting things up in the macros easier. Unnecessary. Gone. ix) The entire endnote part should go, I think. I keep the number to allow for adding new endnotes as the text is processed. It's really just a backup (in a way, all these meta-info are, given that I keep it all in a database, too.) Gone. x) GOTO-MENU section: Gone. This is a section for the REB1100 (and specifically for rbmake, I believe); it allows for up to seven (that's right, folks, a whole seven!) pop-up TOC links. So what do you think? I know this is a super-long post: don't feel that you have to respond to everything, just whatever you think is good or bad. I'm looking to develop a best-practice here, and I don't see a lot of discussion about embedding meta-info. Thanks for reading, m a r

05-19-2009, 07:17 AM	#14
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Hmmm, another thought. Does it make sense to include the following two things? #1: versioning info. Often when I get a file, the version number is basically meaningless. Here's the versioning ranks I use: Code: 0.10 Initial Conversion 0.20 Cover and Frontispiece 0.30 Sections, Chapters and TOC 0.40 Endnotes and/or Blockquotes 0.50 Initial Spellcheck 0.60 Mdashes and Hyphens and Ellipses 0.70 Italics, Bold, and Pre-Formatted Text 0.80 Reading Proof 0.90 Checked against Canonical Source 1.00 Touched Up and Packaged For Release #2: structure hints. Should someone have to examine the entire source for a clue to how it's been assembled? Or could you add something like this as a comment?: Code: Title: h1 class="title" Subtitle: h3 class="subtitle" Chapter: h3 class="chapter" Paragraph: p class="normal" Epigram: p class="epigram" etc. etc. m a r

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any ongoing work on the epub specification?	b.tarde	ePub	10	03-18-2010 09:33 PM
ePub and top margin specification	tompe	Upload Help	6	01-02-2010 12:24 PM
Ask about specification	bthoven	PocketBook	35	11-13-2009 01:33 PM
BeBook 2 Specification	keng2000	BeBook	6	11-02-2009 02:17 PM
PRS-500 lrf file specification	Dave Berk	Sony Reader Dev Corner	2	05-01-2007 03:12 AM

05-18-2009, 01:17 AM	#1
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	(x)html ebook specification I'd like to get some brains on this subject. I'm (slowly) assembling and editing a giant library of HTML ebooks. I've been using an idiosyncratic mix of HTML 3.2 and XHTML that I've picked up over the years. I use an obsolete reader -- but a very functional one! (The REB1100.) I'm looking to upgrade, though, soon. And I'd like to do this only once, the editing/organizing. I use an awesome text editor, NoteTab Pro, which lets you assemble libraries of macros -- anything you can do to text, you can do with the macros, it's got an enormous language. So I've built a library with a hundred or two macros, that do everything from regex to boilerplate to file manipulation and database entries. So I need some advice on a better 'spec' for the format -- I should be able to rewrite the macros to the new one, and write a few that auto-adapt the old stuff I've done already. Creating macros to write CSS for any reader should be dead-simple, or writing converters to straight HTML, also -- once the format is set and consistent. There are a lot of ideas out there, and I have my own, which I'll start with: This spec should use XHTML, and CSS. But the document markup should be as simple as possible. Here are the elements that I think are important in an ebook, primarily fiction books -- a mix of meta-data and structure -- the meta is often explicitly expressed in the book: please add on if I've missed something. Book Meta: Author(s), Illustrator(s), Publisher, ISBN, Publishing Date, Publishing City, Copyright Owner, Copyright Date, Series Name, Title, Sub-Title File Meta: Version Number, Version Date, Original Conversion Date, Scanner, Proofreader(s), Original Source Structure: Cover, Front Matter, Title Page, Verso Page (book meta info page), Inscription, Acknowledgments, Preface, Foreword, Table of Illustrations/Maps, Table of Contents, Prologue, Parts, Chapters, Epigrams, Sections, Sub-Sections, Paragraphs, Epilogue, Afterword, Endnotes, Glossary, Index, End Matter If I've missed anything, please add or suggest. In my next post, I'm going to add my current methods, and ask for advice on improvements. Thanks for reading! m a r

05-18-2009, 06:50 AM	#3
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Read your thread/link. It's funny, I already do something like that, too, with all the meta-data. More than what you've got. Anal-retentively more. 'Course, I'm not using Calibre (yet.) And the utility I use is my text editor, not a pre-processor to call Calibre... I think what I'm shooting for here is a utility/software/reader neutral way to present an ebook in simplest HTML -- consistency being the key. Then anyone can take it and convert it pretty easily. Heck, you could use it for your h2lrf, almost without effort -- just a minor change to the meta you look for. As for the CSS macros -- I think what I'm talking about is that each ebook reader (hardware) should probably have its own CSS, right? I mean, what looks good on a 5" JetBook, probably doesn't look as good on an 11" DS1000. So you'd want to answer a few questions in a dialog (well, I would) about what sort of reader you're trying to make an EPUB for. Then, boom, CSS created. Maybe it calls some common defaults in a common.css file or such. Now that I re-read my thoughts above, I think I'm talking about two separate things. Common, single CSS will be all that's necessary. Later in the process, when I want to make a package for a hardware reader, then an additional CSS macro might be necessary. You're right, as usual pepak. Anyway, gonna dig out my old Barsoom folder, and grab A Princess of Mars to use as an example for the next stuff. Back later, m a r

05-18-2009, 09:38 AM	#4
JSWolf Resident Curmudgeon Posts: 81,222 Karma: 150263711 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Since you are coding XHTML, why not have a look at ePub? It would do quite well for you (IMHO).

05-18-2009, 11:01 AM	#5
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Yeah, I can't seem to find a clear set of guidelines/tutorial for ePub -- they all (so far) seem to assume a level of familiarity with XML that I don't have. I'm never happy if I just mimic without understanding -- and that takes a while. I think I might use this thread to teach myself how to do it (ePub) properly, and modularly, with great metadata, and good in-book navigation. Just do it a piece at a time, and hope folks chime in when I'm screwing up. It's the XML spine, etc. where I start to get truly lost. 'Course, if I figure it out once, I can just macro the heck out of it. m a r

05-18-2009, 12:39 PM	#6
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	The wiki can help in your research. It has most of the topics you have expressed interest in and can provide a starting point. If you find any deficiencies you can correct them! or ask for help. Dale

05-19-2009, 05:24 AM	#10
gwynevans Wizzard Posts: 1,402 Karma: 2000000 Join Date: Nov 2007 Location: UK Device: iPad 2, iPhone 6s, Kindle Voyage & Kindle PaperWhite	> 1) Is there a reason to prefer XHTML 1.0 over XHTML 1.1? None that I know of - I guess I just had the 1.0 header to hand & for this particular use, I don't think there was any difference between 1.0 & 1.1. > 3) gwynevans -- did you whip this out as an example, or was it something you have? I ask because it has no title, for instance. At the time I'd not considered custom metadata & pre-processing, so just had a 'build_ePub.bat' in the folder which set some of the metadata via the Caliber command-line, e.g. 'html2epub --margin-right=10 --level1-toc="//h2" --chapter="//h2" --cover="Konrath, J.A - Jack Daniels 01 - Whiskey Sour.png" -t "Whiskey Sour" -a "Konrath, J.A" "Konrath, J.A - Jack Daniels 01 - Whiskey Sour.html"' > 4) I'm thinking to move all CSS to a separate file, any reason I shouldn't? If you've come up with a standard set of styles that you want to reuse, then it's worth considering, although the main reason to do so in the web-site case is to allow global changes by editing the one file, which may be less of an issue in this particular usage. > (BTW, the page-break-before: always; part -- is that specific to ebooks, or part of XHTML? 'Cause I've been wondering about how to hard-code that. Standard, but it's less well known as it's focussing on the print side of things - http://www.w3schools.com/Css/pr_print_pagebb.asp

Advert

Advert