Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 09-20-2007, 03:05 PM   #1
phrodod
Enthusiast
phrodod began at the beginning.
 
phrodod's Avatar
 
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
GutenMark to convert PG books to HTML

For those converting Project Gutenberg (PG) e-books for use in Book Designer (BD) or other HTML-based conversion software (read, HTML2LRF), check out GutenMark (http://www.sandroid.org/GutenMark/). This program takes PG plain text files, and automatically converts them to HTML. It isn't perfect, and it's a command-line program, but it would sure save someone like HarryT hours and hours of changing _words_ to words.

It's independent from PG, but PG does link to it from their site.

I've only tried it quickly, but in my quick tests, it appears to handle:
  • Changing a wide variety of italics substitutes to real italics
  • Changing regular quotes to curly quotes
  • Changing double-dashes (--) to em dashes
  • Highlighting chapter titles as headers
  • Converting uppercase titles to mixed case

The final output looks pretty good, and would sure save hours of reformatting in BD. Instead, you'd start pretty far along the process, and just use BD for final touches (TOC, Title Page, etc).

Has anyone used it already? Were the results good, bad, or ugly? Any hints or suggestions on the "best" way to run it?

Thanks, and enjoy!
Phrodod
phrodod is offline   Reply With Quote
Old 09-20-2007, 04:15 PM   #2
andym
Groupie
andym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-books
 
Posts: 189
Karma: 793
Join Date: Oct 2006
Hi thanks for the tip. I had been trying to download this for some time without success. I must admit I'm still put off by the fact that it's a 'command-line application' - whatever that is (I'm on OS X and I keep well away from the Terminal).

Oh and you can do a lot of this with a text editor that understands regular expressions (is anyone else using these BTW?), with a fairly simple expression changing _word_ to word can be done in seconds.
andym is offline   Reply With Quote
 
Enthusiast
Old 09-20-2007, 05:43 PM   #3
phrodod
Enthusiast
phrodod began at the beginning.
 
phrodod's Avatar
 
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
Quote:
Originally Posted by andym View Post
Oh and you can do a lot of this with a text editor that understands regular expressions (is anyone else using these BTW?), with a fairly simple expression changing _word_ to word can be done in seconds.
If you're already comfortable with an editor that understands regular expressions (REs), why are you afraid of the terminal? REs are a nightmare to most people, and even to many people who are very comfortable at the command line, they're a bit of voodoo black magic. If you can master REs, a terminal should be a piece of cake.

The basic syntax for running GutenMark is to open a terminal, change to the directory where your books is stored, and type "gutenmark file.txt file.html". When you hit enter, the program writes a few diagnostic lines to the terminal, writes your file, and exits. When the program exits, your terminal gives you a prompt, letting you know that it's ready for more commands.

While each of these items is individually pretty easy to do, it took my computer about twenty seconds to reformat Anna Karenina, while the same task took me about 3 hours, even with some other tools to help me out. On the reader, Anna Karenina is an almost 2000 page book at small font size, or over 4000 pages at large size. Given that, I think that the 20 seconds time frame is pretty amazing.

I haven't yet taken the time to run the result through either BD or HTML2LRF, so I can't guarantee that it'll look great, but it sure looks good in my web browser!

Phrodod
phrodod is offline   Reply With Quote
Old 09-20-2007, 09:38 PM   #4
ricdiogo
Gutenberger
ricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enoughricdiogo will become famous soon enough
 
ricdiogo's Avatar
 
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
Guiguts

We also use a tool called Guiguts for auto generating html.
Guiguts isn't a command-line program so you have an user interface.
See the installation instructions for: Windows, Mac, and Linux.
After having Guiguts working you should copy, paste and save PG's text into it and run Fixup/HTML Fixup/Autogenerate HTML.

Last edited by ricdiogo; 09-20-2007 at 09:53 PM. Reason: installation and usage
ricdiogo is offline   Reply With Quote
Old 09-30-2007, 09:17 PM   #5
angelyne
Enthusiast
angelyne is on a distinguished road
 
Posts: 47
Karma: 50
Join Date: Jun 2007
Device: Sony Reader PRS-500
I've been trying all evening and there is always "7 users on" and I should try again later.
angelyne is offline   Reply With Quote
Old 10-01-2007, 02:43 AM   #6
andym
Groupie
andym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-booksandym has learned how to read e-books
 
Posts: 189
Karma: 793
Join Date: Oct 2006
Quote:
Originally Posted by phrodod View Post
If you're already comfortable with an editor that understands regular expressions (REs), why are you afraid of the terminal? REs are a nightmare to most people, and even to many people who are very comfortable at the command line, they're a bit of voodoo black magic. If you can master REs, a terminal should be a piece of cake.

While each of these items is individually pretty easy to do, it took my computer about twenty seconds to reformat Anna Karenina, while the same task took me about 3 hours, even with some other tools to help me out. On the reader, Anna Karenina is an almost 2000 page book at small font size, or over 4000 pages at large size. Given that, I think that the 20 seconds time frame is pretty amazing.
Yes I'm probably just being a baby avoiding the Terminal.

20 seconds is pretty good (though 3 hours sounds like rather a lot).

I am wondering though: does it replace quotes with curly quotes and text marked with forward slashes (or underscores or block capitals) with italics as claimed on the features page. The reason for asking is that manybooks.net uses GutenMark and the texts I've seen don't have curly quotes etc.
andym is offline   Reply With Quote
Old 10-02-2007, 07:47 AM   #7
FangornUK
Addict
FangornUK has a complete set of Star Wars action figures.FangornUK has a complete set of Star Wars action figures.FangornUK has a complete set of Star Wars action figures.FangornUK has a complete set of Star Wars action figures.
 
FangornUK's Avatar
 
Posts: 205
Karma: 317
Join Date: Oct 2006
Location: England
Device: Sony PRS-505, iPad, Kindle 3
Don't forget gutlrf - http://www.mobileread.com/forums/showthread.php?t=8532 It uses Gutenmark to automatically to convert text PG titles into LRF files.
FangornUK is offline   Reply With Quote
Old 10-02-2007, 01:28 PM   #8
phrodod
Enthusiast
phrodod began at the beginning.
 
phrodod's Avatar
 
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
Quote:
Originally Posted by FangornUK View Post
Don't forget gutlrf - http://www.mobileread.com/forums/showthread.php?t=8532 It uses Gutenmark to automatically to convert text PG titles into LRF files.
Very cool! I don't know how I missed this one, but I'll try it out tonight when I get a chance!

Phrodod
phrodod is offline   Reply With Quote
Old 10-02-2007, 01:28 PM   #9
phrodod
Enthusiast
phrodod began at the beginning.
 
phrodod's Avatar
 
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
Quote:
Originally Posted by andym View Post
Yes I'm probably just being a baby avoiding the Terminal.

20 seconds is pretty good (though 3 hours sounds like rather a lot).

I am wondering though: does it replace quotes with curly quotes and text marked with forward slashes (or underscores or block capitals) with italics as claimed on the features page. The reason for asking is that manybooks.net uses GutenMark and the texts I've seen don't have curly quotes etc.
I'll try to double-check this tonight to determine the answer.

I admit that my 3-hour attempt on Anna Karenina was attempting to do more than simply replace underscores with italics. I was playing around with splitting the file up to make a reasonably-sized TOC for the reader, with internal TOCs for each "part" (of which AK has 8).

While it worked out well, I agree that 3 hours is excessive, which is why I had been looking for an alternative.

Phrodod

Last edited by phrodod; 10-02-2007 at 01:32 PM.
phrodod is offline   Reply With Quote
Old 10-02-2007, 01:35 PM   #10
VillageReader
Evangelist
VillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with othersVillageReader plays well with others
 
VillageReader's Avatar
 
Posts: 430
Karma: 2718
Join Date: May 2006
Device: Iliad
PG is actually in the process of converting many of their texts to html, and when they do most of them are adding the images in the original book or magazine (some of the early Scientific American magazines are a blast from the past). Depending on the book, it might be one image in the front, up to several images per chapter.
VillageReader is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best way to convert HTML? enarchay Sony Reader 1 05-25-2009 10:21 AM
Make Ebooks Pretty with GutenMark Timoleon Workshop 4 03-09-2009 06:22 PM
GutenMark Bellonius Software 1 09-03-2008 01:38 AM
Convert LRF to HTML ggbal Calibre 5 08-08-2008 02:34 PM
How do I convert my A4 HTML books to A5 PDF? CommanderROR iRex 18 07-21-2006 03:50 PM


All times are GMT -4. The time now is 05:33 AM.


MobileRead.com is a privately owned, operated and funded community.