Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-22-2008, 02:58 AM   #1
cerement
Groupie
cerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it is
 
cerement's Avatar
 
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
Prepping texts for conversion?

Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?

What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website):
  1. Remove Gutenberg boilerplate
  2. Escape out existing characters <, >, &
  3. Encode accented characters
  4. Convert and encode special characters endash, emdash, ellipses
  5. Convert quotes to curly quotes, apostrophes
  6. Clean up double spaces, remove spaces at end of lines
  7. Convert multiple blank lines to a rule (from Gutenberg)
  8. Add in HTML header and footer
  9. Add in paragraph marks
  10. Mark up headings
  11. Clean up special line breaks and indents
  12. Italics and bold
  13. Images

Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...
cerement is offline   Reply With Quote
Old 04-22-2008, 09:18 AM   #2
WillAdams
Wizard
WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.WillAdams ought to be getting tired of karma fortunes by now.
 
WillAdams's Avatar
 
Posts: 1,234
Karma: 3350652
Join Date: Feb 2008
Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12
It's not possible to algorithmically handle intermingled quotes &c., which is why compleat document tagging schemes explicitly mark up beginning and ending quotes &c.

A search-replace which marks the first set replaced, then replaces the other, then re-replaces the marked set will get one most of the way though, esp. if one uses GREP to exclude likely candidates for apostrophes.

Then, look at and determine which ones need to be quotes, which apostrophes and which ones primes as a final confirming check.

William
WillAdams is offline   Reply With Quote
Advert
Old 04-22-2008, 09:24 AM   #3
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by cerement View Post
Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...
Converting 5'10" is easy. You just convert based on the space after the " and the fact that no space exists before the ". As for the ', you convert it based on the fact of no spaces between the ' and no other '.
JSWolf is offline   Reply With Quote
Old 04-22-2008, 10:17 AM   #4
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Trouble is, though, that those methods assume that all the quotes in the source file are "correct". If one's missing, you'll mess everything up.

What I do personally is leave whatever quotes the source doc has well alone - if they're curly quotes they stay curly; if they're straight, they stay straight.
HarryT is offline   Reply With Quote
Old 04-23-2008, 08:52 AM   #5
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I check the quotes manually. First look for apostrophes and ensure they are correct, then look for quotation marks and ensure they come in pairs (except when there are paragraph breaks inside a quote) and are properly nested. It's time consuming, but I've always found automatic search and replace leads to more errors, especially, as HarryT says, when the source file is sloppy.
Jellby is offline   Reply With Quote
Advert
Old 04-24-2008, 07:41 PM   #6
ricdiogo
Gutenberger
ricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enough
 
ricdiogo's Avatar
 
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
Quote:
Originally Posted by cerement View Post
Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?
If downloading the already PG-to-HTML-converted etexts from Manybooks.net isn't enough for you or you still like doing it yourself, you can use a piece of free software called GutenMark, specially developed for converting PG's ebooks.
ricdiogo is offline   Reply With Quote
Old 04-24-2008, 08:20 PM   #7
cerement
Groupie
cerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it is
 
cerement's Avatar
 
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards. The ManyBooks editions do a good job of Table of Contents but they don't do any typographic cleanup (dashes, ellipses, quotes, etc.)

I've been playing around with the list I've got above and eventually I'm going to update it (things like escaping '&' must occur before anything else). The process will always require human intervention (simply because each Gutenberg transcription was by a person) and an important part of the process is learning the original transcriber's style.

Two items that help, a reference on the MobileRead wiki to
Code:
<mbp:pagebreak/>
for layout control and Google Books archive of scanned books (for checking the original layout and typography).
cerement is offline   Reply With Quote
Old 04-25-2008, 09:08 PM   #8
ricdiogo
Gutenberger
ricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enough
 
ricdiogo's Avatar
 
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
Quote:
Originally Posted by cerement View Post
I've tried GutenMark (and gut.pl) and the results are decent but too much hand-editing afterwards.
I know what you mean. I suggest you subscribe to gutvol-d, the discussion group at PG and ask for some help from other volunteers.

Since you're going to produce those HTML yourself we would be very thankful if you could send them to our PG's Posting Team.
ricdiogo is offline   Reply With Quote
Old 04-26-2008, 02:59 AM   #9
cerement
Groupie
cerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it is
 
cerement's Avatar
 
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
In this case, the text I was working with already had an HTML version available on Gutenberg (Round About the Carpathians), but the HTML version provided was created for a large screen, attempted to convey more accurate layout, and was trying to stay within PG's guidelines for clean markup.

In the long run, what I wanted was to end up with a version optimized for my new Kindle (smaller screen, readable rather than accurate layout, Mobipocket and Kindle specific markup). And being the contrary person that I am, I figured it would take less time to start from the text file provided rather than trying to mangle the provided HTML.
cerement is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Origins of PD texts? corroonb Upload Help 1 08-09-2009 11:10 AM
about digitally acquiring texts megotrafigon Workshop 2 05-31-2008 12:09 PM
iriver-prepping-handwriting-friendly-e-book-tablet parryl News 1 01-13-2008 07:23 AM
Book Conversion - Changing Original Texts RWood Sony Reader 13 04-18-2007 01:28 PM
Best PDA To Read e-Texts On? Colin Dunstan Lounge 0 05-07-2004 07:50 AM


All times are GMT -4. The time now is 09:29 PM.


MobileRead.com is a privately owned, operated and funded community.