Prepping texts for conversion?

cerement · 04-22-2008, 02:58 AM

Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?

What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website):

Remove Gutenberg boilerplate
Escape out existing characters <, >, &
Encode accented characters
Convert and encode special characters endash, emdash, ellipses
Convert quotes to curly quotes, apostrophes
Clean up double spaces, remove spaces at end of lines
Convert multiple blank lines to a rule (from Gutenberg)
Add in HTML header and footer
Add in paragraph marks
Mark up headings
Clean up special line breaks and indents
Italics and bold
Images

Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...

WillAdams · 04-22-2008, 09:18 AM

It's not possible to algorithmically handle intermingled quotes &c., which is why compleat document tagging schemes explicitly mark up beginning and ending quotes &c.

A search-replace which marks the first set replaced, then replaces the other, then re-replaces the marked set will get one most of the way though, esp. if one uses GREP to exclude likely candidates for apostrophes.

Then, look at and determine which ones need to be quotes, which apostrophes and which ones primes as a final confirming check.

William

JSWolf · 04-22-2008, 09:24 AM

Quote:

Originally Posted by cerement

Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...

Converting 5'10" is easy. You just convert based on the space after the " and the fact that no space exists before the ". As for the ', you convert it based on the fact of no spaces between the ' and no other '.

HarryT · 04-22-2008, 10:17 AM

Trouble is, though, that those methods assume that all the quotes in the source file are "correct". If one's missing, you'll mess everything up.

What I do personally is leave whatever quotes the source doc has well alone - if they're curly quotes they stay curly; if they're straight, they stay straight.

Jellby · 04-23-2008, 08:52 AM

I check the quotes manually. First look for apostrophes and ensure they are correct, then look for quotation marks and ensure they come in pairs (except when there are paragraph breaks inside a quote) and are properly nested. It's time consuming, but I've always found automatic search and replace leads to more errors, especially, as HarryT says, when the source file is sloppy.

ricdiogo · 04-24-2008, 07:41 PM

Quote:

Originally Posted by cerement

Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?

If downloading the already PG-to-HTML-converted etexts from Manybooks.net isn't enough for you or you still like doing it yourself, you can use a piece of free software called GutenMark, specially developed for converting PG's ebooks.

cerement · 04-24-2008, 08:20 PM

I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards. The ManyBooks editions do a good job of Table of Contents but they don't do any typographic cleanup (dashes, ellipses, quotes, etc.)

I've been playing around with the list I've got above and eventually I'm going to update it (things like escaping '&' must occur before anything else). The process will always require human intervention (simply because each Gutenberg transcription was by a person) and an important part of the process is learning the original transcriber's style.

Two items that help, a reference on the MobileRead wiki to

Code:

<mbp:pagebreak/>

for layout control and Google Books archive of scanned books (for checking the original layout and typography).

ricdiogo · 04-25-2008, 09:08 PM

Quote:

Originally Posted by cerement

I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards.

I know what you mean. I suggest you subscribe to gutvol-d, the discussion group at PG and ask for some help from other volunteers.

Since you're going to produce those HTML yourself we would be very thankful if you could send them to our PG's Posting Team.

cerement · 04-26-2008, 02:59 AM

In this case, the text I was working with already had an HTML version available on Gutenberg (Round About the Carpathians), but the HTML version provided was created for a large screen, attempted to convey more accurate layout, and was trying to stay within PG's guidelines for clean markup.

In the long run, what I wanted was to end up with a version optimized for my new Kindle (smaller screen, readable rather than accurate layout, Mobipocket and Kindle specific markup). And being the contrary person that I am, I figured it would take less time to start from the text file provided rather than trying to mangle the provided HTML.

04-22-2008, 02:58 AM	#1
cerement Groupie Posts: 170 Karma: 2000 Join Date: Apr 2008 Location: San José, CA Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3	Prepping texts for conversion? Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks? What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website): Remove Gutenberg boilerplate Escape out existing characters <, >, & Encode accented characters Convert and encode special characters endash, emdash, ellipses Convert quotes to curly quotes, apostrophes Clean up double spaces, remove spaces at end of lines Convert multiple blank lines to a rule (from Gutenberg) Add in HTML header and footer Add in paragraph marks Mark up headings Clean up special line breaks and indents Italics and bold Images Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...

04-24-2008, 08:20 PM	#7
cerement Groupie Posts: 170 Karma: 2000 Join Date: Apr 2008 Location: San José, CA Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3	I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards. The ManyBooks editions do a good job of Table of Contents but they don't do any typographic cleanup (dashes, ellipses, quotes, etc.) I've been playing around with the list I've got above and eventually I'm going to update it (things like escaping '&' must occur before anything else). The process will always require human intervention (simply because each Gutenberg transcription was by a person) and an important part of the process is learning the original transcriber's style. Two items that help, a reference on the MobileRead wiki to Code: <mbp:pagebreak/> for layout control and Google Books archive of scanned books (for checking the original layout and typography).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Origins of PD texts?	corroonb	Upload Help	1	08-09-2009 11:10 AM
about digitally acquiring texts	megotrafigon	Workshop	2	05-31-2008 12:09 PM
iriver-prepping-handwriting-friendly-e-book-tablet	parryl	News	1	01-13-2008 07:23 AM
Book Conversion - Changing Original Texts	RWood	Sony Reader	13	04-18-2007 01:28 PM
Best PDA To Read e-Texts On?	Colin Dunstan	Lounge	0	05-07-2004 07:50 AM

04-22-2008, 09:18 AM	#2
WillAdams Wizard Posts: 1,234 Karma: 3350652 Join Date: Feb 2008 Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12	It's not possible to algorithmically handle intermingled quotes &c., which is why compleat document tagging schemes explicitly mark up beginning and ending quotes &c. A search-replace which marks the first set replaced, then replaces the other, then re-replaces the marked set will get one most of the way though, esp. if one uses GREP to exclude likely candidates for apostrophes. Then, look at and determine which ones need to be quotes, which apostrophes and which ones primes as a final confirming check. William

04-22-2008, 10:17 AM	#4
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Trouble is, though, that those methods assume that all the quotes in the source file are "correct". If one's missing, you'll mess everything up. What I do personally is leave whatever quotes the source doc has well alone - if they're curly quotes they stay curly; if they're straight, they stay straight.

04-23-2008, 08:52 AM	#5
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I check the quotes manually. First look for apostrophes and ensure they are correct, then look for quotation marks and ensure they come in pairs (except when there are paragraph breaks inside a quote) and are properly nested. It's time consuming, but I've always found automatic search and replace leads to more errors, especially, as HarryT says, when the source file is sloppy.

04-26-2008, 02:59 AM	#9
cerement Groupie Posts: 170 Karma: 2000 Join Date: Apr 2008 Location: San José, CA Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3	In this case, the text I was working with already had an HTML version available on Gutenberg (Round About the Carpathians), but the HTML version provided was created for a large screen, attempted to convey more accurate layout, and was trying to stay within PG's guidelines for clean markup. In the long run, what I wanted was to end up with a version optimized for my new Kindle (smaller screen, readable rather than accurate layout, Mobipocket and Kindle specific markup). And being the contrary person that I am, I figured it would take less time to start from the text file provided rather than trying to mangle the provided HTML.

Advert

Advert