Road to epub

radius · 11-26-2009, 12:09 PM

I posted my first e-book yesterday and user Tony Foley asked me for an outline of how I put it together so I thought I would write this explanation...

The short form is: take source, convert to HTML, run that through Calibre or Sigil to create an epub.

About me:

In general I'm conservative on formatting, preferring stability and portability over using all the features of a particular device. I'm also more comfortable using a plain text editor (like, say, TextPad or UltraEdit) than MS Word. If you like Word, then you will probably be better off reading the excellent Book Designer sticky.

Sources:

I have a wide variety of initial source material.

Initially, I started reading on PalmOS and Psion devices in the mid- to late-nineties and have always had at least some consideration for long-term data storage in mind. This means that the majority of books I keep are in plain text, HTML, or RTF, in that order of preference, so that they can be opened and manipulated by the widest variety of tools possible. I convert MS Word, or other word processing documents importable by Word, into plain text or RTF (although with the recent, cleaner, HTML export from Word I may start trying that instead).

I also obtained a number of books in prc or pdb containers. These could actually be one of a number of e-book formats inside (plain Palm DOC, Mobipocket, eReader etc.) and these I converted to plain text usually a command line program. Since, in the last few years, formats which are easier to deal with have become popular, I no longer download Palm-based e-books, so I haven't had to do this conversion in a long time and don't remember the name of the conversion tool.

Working copy:

Originally, my preference was for completely plain ASCII text, with no linebreaks inside paragraphs, and double linebreaks between paragraphs; single linebreaks between lines of poems/songs/quotes. This can be handled by pretty much any software on any platform and will almost certainly still be easily readable fifty years from now.

So if the source was HTML, then I basically deleted anything between angle brackets to create a text-only version. If the source was Word, I did a copy and paste from Word into a text editor.

Lately, I have decided to use HTML/CSS as the working copy. This is because that combination is still mostly human-readable text, while allowing me to tag meaning onto particular bits of the text using HTML (ie: this is a paragraph, this is a chapter heading, this is a quote, and so on), and allowing some control over the appearance of the book using CSS (ie: paragraphs should be indented and have no margin above or below, chapter headings should be displayed in sans-serif font, and so on).

There are also a wide variety of tools for manipulating and editing HTML.

Also, I initially split books into sections (at chapter breaks, in roughly 100-200kb pieces) because of the low memory and processing power of the devices I was reading on, but lately I have been keeping the entire book in a single file for easier editing and file management.

So if my original source was plain text, converting to HTML is straight forward. Tag all the special bits like the book title and chapter headings, then wrap all the paragraphs in tags. I don't care very much about things like curly quotes, em-dashes and so on. I do have a preference for Unicode over HTML entities.

If the original source was HTML, I strip out anything not germaine to the text (like Javascript or advertising, if it came from the web) and any HTML that tries to control presentation, I try to change so that it indicates only meaning. Some simple examples:

- If text is italicized, I might change that to , indicating emphasis
- But for other italicized text, I might change that instead to , indicating that the reason it appears in italic text is because it is the title of a publication
- Bold text, I might change to 
- Or I might change to <h2 class="chapter_heading">, indicating a chapter heading, if that's why the text was bold

My text editor support regular expressions in search-and-replace so this isn't quite as much work as it sounds.

Occasionally, I skim through the book using a web browser to see if I've missed closing a tag (e.g., if I marked some text as emphasized using but forgot to mark where the emphasis ends using then everything from that point onwards will appear italicized in the web browser)

I'm still in the process of working on some CSS stylesheets to control formatting, and to come up with a good set of classes. I've taken some ideas from forums users like rogue_ronin, llahsram, and hadrien so far, but still haven't come to the right balance for me so far. On the one hand, I want to do the minimum possible marking up so that it is convenient and easy for me. On the other hand, more markup is probably better for long term storage as I will have more information about/inside the book and can more easily control its appearance.

Another reason I alternate between stylesheets is because I don't like every book to look exactly the same (one of the very few things I don't like about Feedbooks). For example, I might set the appearance of chapter headings to be flush left, sans-serif, bold, in one book but to be centred, monospace, italic in another. I also try to choose a font that I think will fit the story better. For example, something clean for a science-fiction story. Maybe something more delicate for historical fiction. Something robust for a political or espionage thriller. And so on.

For creating or editing the working copy, I might use an HTML editor instead of a text editor if I want a live preview or auto-completion of tags etc. On the Mac I like TacoEdit. On Windows, I haven't found anything lightweight and free that I really like, but DreamWeaver is great if you have that installed.

If I'm doing a lot of search-and-replace I stick with TextPad, Notepad++ or UltraEdit on Windows.

Target format:

What conversion program you use to create something suitable for reading depends on which device you have, but almost all of them will accept HTML as input.

In the specific case of creating ePubs, I am still deciding what my favourite solution is.

I am too lazy to hand code everything like some forum members have (that's what computers are for!)

On the other hand, both Calibre and Sigil do some things I don't like.

Calibre is very convenient, especially if you also use it as your library program. However, I find that it potentially modifies your original HTML source quite a bit. This means that the readers of the resultant epub lose some of the information you put into it. If you are only creating an epub to put onto your own reader device then obviously this doesn't matter.

Sigil does a better job of preserving your HTML as you wrote it. However, it is still in early development stages and I have found it buggier than Calibre for now. It also seems to be going in a different direction than I would prefer. I think Valloric is steering it more towards becoming a WYSIWYG epub editor, whereas I am looking for something more like Mobipocket Creator.

Hope this is helpful.

HarryT · 11-26-2009, 12:43 PM

Quote:

Originally Posted by radius

Originally, my preference was for completely plain ASCII text, with no linebreaks inside paragraphs, and double linebreaks between paragraphs; single linebreaks between lines of poems/songs/quotes. This can be handled by pretty much any software on any platform and will almost certainly still be easily readable fifty years from now.

So if the source was HTML, then I basically deleted anything between angle brackets to create a text-only version. If the source was Word, I did a copy and paste from Word into a text editor.

It's people like you who are the reason that I have to spend hundreds of hours proof-reading a book to put back all the formatting (italics, accented letters, etc) that you've removed from it

.

Seriously, things like italics are integral to a text, and losing them removes a great deal from a book.

radius · 11-26-2009, 03:49 PM

Quote:

Originally Posted by HarryT

It's people like you who are the reason that I have to spend hundreds of hours proof-reading a book to put back all the formatting (italics, accented letters, etc) that you've removed from it

.

Seriously, things like italics are integral to a text, and losing them removes a great deal from a book.

Hi Harry,

In my first paragraph, I said that I had just released my *first* book to the public. Prior to that, I have been assembling books for my personal use only, or adapting downloaded books to my personal taste. :P I *did* say that I am conservative with regards to formatting...

In any case, at the time, I strongly agreed with Project Gutenberg's original mandate to render everything in ASCII if possible, representing accented characters using regular punctuation (for example e' for acute accent, or c, for c-cedille in French) and so on because in those days the word processor wars were still on-going. I have seen too many data formats disappear and become difficult to read.

Also, non-ASCII characters were difficult to represent for most people because Unicode fonts were not widely available (and still not that many are today) and language encoding was still a rather esoteric subject.

Now that HTML is a wide spread markup format, and Unicode is an established standard in wide use, I find it safer to trust my data to them.

I always kind of liked ASCII style /italics/, _underlining_ and *bold* too ^_^

11-26-2009, 12:09 PM	#1
radius Lector minore Posts: 665 Karma: 1738720 Join Date: Jan 2008 Device: Aura One, Paperwhite Signature	Road to epub I posted my first e-book yesterday and user Tony Foley asked me for an outline of how I put it together so I thought I would write this explanation... The short form is: take source, convert to HTML, run that through Calibre or Sigil to create an epub. About me: In general I'm conservative on formatting, preferring stability and portability over using all the features of a particular device. I'm also more comfortable using a plain text editor (like, say, TextPad or UltraEdit) than MS Word. If you like Word, then you will probably be better off reading the excellent Book Designer sticky. Sources: I have a wide variety of initial source material. Initially, I started reading on PalmOS and Psion devices in the mid- to late-nineties and have always had at least some consideration for long-term data storage in mind. This means that the majority of books I keep are in plain text, HTML, or RTF, in that order of preference, so that they can be opened and manipulated by the widest variety of tools possible. I convert MS Word, or other word processing documents importable by Word, into plain text or RTF (although with the recent, cleaner, HTML export from Word I may start trying that instead). I also obtained a number of books in prc or pdb containers. These could actually be one of a number of e-book formats inside (plain Palm DOC, Mobipocket, eReader etc.) and these I converted to plain text usually a command line program. Since, in the last few years, formats which are easier to deal with have become popular, I no longer download Palm-based e-books, so I haven't had to do this conversion in a long time and don't remember the name of the conversion tool. Working copy: Originally, my preference was for completely plain ASCII text, with no linebreaks inside paragraphs, and double linebreaks between paragraphs; single linebreaks between lines of poems/songs/quotes. This can be handled by pretty much any software on any platform and will almost certainly still be easily readable fifty years from now. So if the source was HTML, then I basically deleted anything between angle brackets to create a text-only version. If the source was Word, I did a copy and paste from Word into a text editor. Lately, I have decided to use HTML/CSS as the working copy. This is because that combination is still mostly human-readable text, while allowing me to tag meaning onto particular bits of the text using HTML (ie: this is a paragraph, this is a chapter heading, this is a quote, and so on), and allowing some control over the appearance of the book using CSS (ie: paragraphs should be indented and have no margin above or below, chapter headings should be displayed in sans-serif font, and so on). There are also a wide variety of tools for manipulating and editing HTML. Also, I initially split books into sections (at chapter breaks, in roughly 100-200kb pieces) because of the low memory and processing power of the devices I was reading on, but lately I have been keeping the entire book in a single file for easier editing and file management. So if my original source was plain text, converting to HTML is straight forward. Tag all the special bits like the book title and chapter headings, then wrap all the paragraphs in tags. I don't care very much about things like curly quotes, em-dashes and so on. I do have a preference for Unicode over HTML entities. If the original source was HTML, I strip out anything not germaine to the text (like Javascript or advertising, if it came from the web) and any HTML that tries to control presentation, I try to change so that it indicates only meaning. Some simple examples: - If text is italicized, I might change that to <em>, indicating emphasis - But for other italicized text, I might change that instead to <span class="book_title">, indicating that the reason it appears in italic text is because it is the title of a publication - Bold text, I might change to <strong> - Or I might change to <h2 class="chapter_heading">, indicating a chapter heading, if that's why the text was bold My text editor support regular expressions in search-and-replace so this isn't quite as much work as it sounds. Occasionally, I skim through the book using a web browser to see if I've missed closing a tag (e.g., if I marked some text as emphasized using <em> but forgot to mark where the emphasis ends using </em> then everything from that point onwards will appear italicized in the web browser) I'm still in the process of working on some CSS stylesheets to control formatting, and to come up with a good set of classes. I've taken some ideas from forums users like rogue_ronin, llahsram, and hadrien so far, but still haven't come to the right balance for me so far. On the one hand, I want to do the minimum possible marking up so that it is convenient and easy for me. On the other hand, more markup is probably better for long term storage as I will have more information about/inside the book and can more easily control its appearance. Another reason I alternate between stylesheets is because I don't like every book to look exactly the same (one of the very few things I don't like about Feedbooks). For example, I might set the appearance of chapter headings to be flush left, sans-serif, bold, in one book but to be centred, monospace, italic in another. I also try to choose a font that I think will fit the story better. For example, something clean for a science-fiction story. Maybe something more delicate for historical fiction. Something robust for a political or espionage thriller. And so on. For creating or editing the working copy, I might use an HTML editor instead of a text editor if I want a live preview or auto-completion of tags etc. On the Mac I like TacoEdit. On Windows, I haven't found anything lightweight and free that I really like, but DreamWeaver is great if you have that installed. If I'm doing a lot of search-and-replace I stick with TextPad, Notepad++ or UltraEdit on Windows. Target format: What conversion program you use to create something suitable for reading depends on which device you have, but almost all of them will accept HTML as input. In the specific case of creating ePubs, I am still deciding what my favourite solution is. I am too lazy to hand code everything like some forum members have (that's what computers are for!) On the other hand, both Calibre and Sigil do some things I don't like. Calibre is very convenient, especially if you also use it as your library program. However, I find that it potentially modifies your original HTML source quite a bit. This means that the readers of the resultant epub lose some of the information you put into it. If you are only creating an epub to put onto your own reader device then obviously this doesn't matter. Sigil does a better job of preserving your HTML as you wrote it. However, it is still in early development stages and I have found it buggier than Calibre for now. It also seems to be going in a different direction than I would prefer. I think Valloric is steering it more towards becoming a WYSIWYG epub editor, whereas I am looking for something more like Mobipocket Creator. Hope this is helpful.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Why did the chicken cross the road?	happy_terd	Lounge	18	03-24-2010 10:33 AM
Greetings From The Road	gpstrucker	Introduce Yourself	8	02-18-2010 10:32 PM
Greetings from Tobacco Road!	GA Russell	Introduce Yourself	10	08-18-2009 07:01 AM
84 Charing Cross Road	sandykayak	Lounge	6	01-01-2009 10:58 AM
Hello from the Road	Carey	Introduce Yourself	17	06-22-2008 07:39 PM

Advert