04-22-2008, 02:58 AM | #1 |
Groupie
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
|
Prepping texts for conversion?
Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?
What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website):
Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ... |
04-22-2008, 09:18 AM | #2 |
Wizard
Posts: 1,234
Karma: 3350652
Join Date: Feb 2008
Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12
|
It's not possible to algorithmically handle intermingled quotes &c., which is why compleat document tagging schemes explicitly mark up beginning and ending quotes &c.
A search-replace which marks the first set replaced, then replaces the other, then re-replaces the marked set will get one most of the way though, esp. if one uses GREP to exclude likely candidates for apostrophes. Then, look at and determine which ones need to be quotes, which apostrophes and which ones primes as a final confirming check. William |
Advert | |
|
04-22-2008, 09:24 AM | #3 | |
Resident Curmudgeon
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
04-22-2008, 10:17 AM | #4 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Trouble is, though, that those methods assume that all the quotes in the source file are "correct". If one's missing, you'll mess everything up.
What I do personally is leave whatever quotes the source doc has well alone - if they're curly quotes they stay curly; if they're straight, they stay straight. |
04-23-2008, 08:52 AM | #5 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I check the quotes manually. First look for apostrophes and ensure they are correct, then look for quotation marks and ensure they come in pairs (except when there are paragraph breaks inside a quote) and are properly nested. It's time consuming, but I've always found automatic search and replace leads to more errors, especially, as HarryT says, when the source file is sloppy.
|
Advert | |
|
04-24-2008, 07:41 PM | #6 | |
Gutenberger
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
|
Quote:
|
|
04-24-2008, 08:20 PM | #7 |
Groupie
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
|
I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards. The ManyBooks editions do a good job of Table of Contents but they don't do any typographic cleanup (dashes, ellipses, quotes, etc.)
I've been playing around with the list I've got above and eventually I'm going to update it (things like escaping '&' must occur before anything else). The process will always require human intervention (simply because each Gutenberg transcription was by a person) and an important part of the process is learning the original transcriber's style. Two items that help, a reference on the MobileRead wiki to Code:
<mbp:pagebreak/> |
04-25-2008, 09:08 PM | #8 | |
Gutenberger
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
|
Quote:
Since you're going to produce those HTML yourself we would be very thankful if you could send them to our PG's Posting Team. |
|
04-26-2008, 02:59 AM | #9 |
Groupie
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
|
In this case, the text I was working with already had an HTML version available on Gutenberg (Round About the Carpathians), but the HTML version provided was created for a large screen, attempted to convey more accurate layout, and was trying to stay within PG's guidelines for clean markup.
In the long run, what I wanted was to end up with a version optimized for my new Kindle (smaller screen, readable rather than accurate layout, Mobipocket and Kindle specific markup). And being the contrary person that I am, I figured it would take less time to start from the text file provided rather than trying to mangle the provided HTML. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Origins of PD texts? | corroonb | Upload Help | 1 | 08-09-2009 11:10 AM |
about digitally acquiring texts | megotrafigon | Workshop | 2 | 05-31-2008 12:09 PM |
iriver-prepping-handwriting-friendly-e-book-tablet | parryl | News | 1 | 01-13-2008 07:23 AM |
Book Conversion - Changing Original Texts | RWood | Sony Reader | 13 | 04-18-2007 01:28 PM |
Best PDA To Read e-Texts On? | Colin Dunstan | Lounge | 0 | 05-07-2004 07:50 AM |