MobileRead Forums - View Single Post

cerement · 04-26-2008, 04:34 AM

-- First, grab the text file for your book of choice from Project Gutenberg
If an HTML version is available (and you can live with the formatting), grab that and save yourself some hassle

-- Open the file up in the text editor of your choice
After trying out several, I've become comfortable with Emeditor Free, but it really is just personal choice. Just make sure that it supports regular expressions. This will save you no end of hassle (I'm also going to use a lot of regex notation further down)

-- Plan for a nice long session of searching and replacing
Before you begin, get to know your text (and more importantly, the idiosyncrasies of the transcriber who prepped it for PG - did they use underlines or asterisks or something else for italics, did they create a funky system to indicate accented characters?) Is the book verse or prose (or worse, a mix)? If you're lucky, Google Books has the scanned copy so you can get a visual idea of how the book looked.

Decide what you want to do with the Gutenberg boilerplate (if you can figure out how to work this thing into your layout, then you're a better man than I am)
Remove double spaces, spaces at the beginning of lines, spaces at the end of lines. (Double spaces were a leftover artifact of typewriters.)
Remove excess blank lines, convert everything into paragraphs (markup will come later). Simplest way is 3 steps: search and replace double returns \n\n with something uncommon like @@@@@, search and replace single returns \n with a space, then search and replace your marker @@@@@ with double returns \n\n
Escape out three special characters: & to & < to < > to > (replace the ampersand first!)
Encode emdashes — endashes – and ellipses … ... watch out for special construct in older texts that try to refer to someone anonymously by the first letter of their name: H---- (initial plus two emdashes))
Encode any other special characters and accented characters (search for [\xA0-\xFF] initially to find them) (and while you're at it, make sure that there are no characters in the ranges [\x00-\x1F] and [\x7F-\x9F] (these are invalid ranges in Latin-1 character set))
Now the hardest part: converting all the " and ' marks to curly quotes! There's a LOT of special cases that have to be caught before the primary conversion. A few of the special cases include measurements 5'10", abbreviated years '78, "'nested' quotes", and 'British' vs. "American" quotes (it it's British quotes, some of the steps below will have to be reversed)
1. Number followed by quote \d', \d" (figure out someway to mask it for later)
2. Opening nested quotes "' “‘
3. Whitespace single quote number \s'\d to ’
4. Whitespace single quote \s' to ‘
5. Line beginning single quote ^' to ‘
6. Leftover single quotes to ’
7. Whitespace double quote \s" to “
8. Line beginning double quote to “
9. Leftover double quotes to &rdquo;
Search for double returns and add in paragraph marks <p>
Add in the HTML header and footer
Markup your italics <em>, bold <strong>, supercript <sup>, and subscript <sub>
Go back and look for items with special line breaks, indents, and blockquotes - if you don't want a paragraph to indent, add class="noind" to your <p> mark and addin the line p.noind {text-indent:0} to your style section - space above a line can be adjusted by adding in a height tag, ex. <h4 height="2em">
Mark up headings and chapter heads <h1> to <h6>
Link the table of contents to chapters, mark the table of contents with <a name="toc">
Mark your starting page with <a name="start">
Prep and link in any images (restrict image sizes to 600 pixels wide by 800 pixels tall)
If chapters are decent length, they can be separated by pagebreaks using <mbpagebreak /> tag (that's mbp <colon> pagebreak ... stupid smiley)

And at this point, you should be ready to head over to HarryT's tutorial with a HTML file all ready to be converted into a Mobipocket file ready for your Kindle.

04-26-2008, 04:34 AM	#2
cerement Groupie Posts: 170 Karma: 2000 Join Date: Apr 2008 Location: San José, CA Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3	-- First, grab the text file for your book of choice from Project Gutenberg If an HTML version is available (and you can live with the formatting), grab that and save yourself some hassle -- Open the file up in the text editor of your choice After trying out several, I've become comfortable with Emeditor Free, but it really is just personal choice. Just make sure that it supports regular expressions. This will save you no end of hassle (I'm also going to use a lot of regex notation further down) -- Plan for a nice long session of searching and replacing Before you begin, get to know your text (and more importantly, the idiosyncrasies of the transcriber who prepped it for PG - did they use underlines or asterisks or something else for italics, did they create a funky system to indicate accented characters?) Is the book verse or prose (or worse, a mix)? If you're lucky, Google Books has the scanned copy so you can get a visual idea of how the book looked. Decide what you want to do with the Gutenberg boilerplate (if you can figure out how to work this thing into your layout, then you're a better man than I am) Remove double spaces, spaces at the beginning of lines, spaces at the end of lines. (Double spaces were a leftover artifact of typewriters.) Remove excess blank lines, convert everything into paragraphs (markup will come later). Simplest way is 3 steps: search and replace double returns \n\n with something uncommon like @@@@@, search and replace single returns \n with a space, then search and replace your marker @@@@@ with double returns \n\n Escape out three special characters: & to & < to < > to > (replace the ampersand first!) Encode emdashes — endashes – and ellipses … ... watch out for special construct in older texts that try to refer to someone anonymously by the first letter of their name: H---- (initial plus two emdashes)) Encode any other special characters and accented characters (search for [\xA0-\xFF] initially to find them) (and while you're at it, make sure that there are no characters in the ranges [\x00-\x1F] and [\x7F-\x9F] (these are invalid ranges in Latin-1 character set)) Now the hardest part: converting all the " and ' marks to curly quotes! There's a LOT of special cases that have to be caught before the primary conversion. A few of the special cases include measurements 5'10", abbreviated years '78, "'nested' quotes", and 'British' vs. "American" quotes (it it's British quotes, some of the steps below will have to be reversed) Number followed by quote \d', \d" (figure out someway to mask it for later) Opening nested quotes "' “‘ Whitespace single quote number \s'\d to ’ Whitespace single quote \s' to ‘ Line beginning single quote ^' to ‘ Leftover single quotes to ’ Whitespace double quote \s" to “ Line beginning double quote to “ Leftover double quotes to &rdquo; Search for double returns and add in paragraph marks <p> Add in the HTML header and footer Markup your italics <em>, bold <strong>, supercript <sup>, and subscript <sub> Go back and look for items with special line breaks, indents, and blockquotes - if you don't want a paragraph to indent, add class="noind" to your <p> mark and addin the line p.noind {text-indent:0} to your style section - space above a line can be adjusted by adding in a height tag, ex. <h4 height="2em"> Mark up headings and chapter heads <h1> to <h6> Link the table of contents to chapters, mark the table of contents with <a name="toc"> Mark your starting page with <a name="start"> Prep and link in any images (restrict image sizes to 600 pixels wide by 800 pixels tall) If chapters are decent length, they can be separated by pagebreaks using <mbpagebreak /> tag (that's mbp <colon> pagebreak ... stupid smiley) And at this point, you should be ready to head over to HarryT's tutorial with a HTML file all ready to be converted into a Mobipocket file ready for your Kindle. Last edited by cerement; 04-26-2008 at 04:38 AM.