09-20-2007, 03:05 PM | #1 |
Enthusiast
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
|
GutenMark to convert PG books to HTML
For those converting Project Gutenberg (PG) e-books for use in Book Designer (BD) or other HTML-based conversion software (read, HTML2LRF), check out GutenMark (http://www.sandroid.org/GutenMark/). This program takes PG plain text files, and automatically converts them to HTML. It isn't perfect, and it's a command-line program, but it would sure save someone like HarryT hours and hours of changing _words_ to words.
It's independent from PG, but PG does link to it from their site. I've only tried it quickly, but in my quick tests, it appears to handle:
The final output looks pretty good, and would sure save hours of reformatting in BD. Instead, you'd start pretty far along the process, and just use BD for final touches (TOC, Title Page, etc). Has anyone used it already? Were the results good, bad, or ugly? Any hints or suggestions on the "best" way to run it? Thanks, and enjoy! Phrodod |
09-20-2007, 04:15 PM | #2 |
Groupie
Posts: 189
Karma: 793
Join Date: Oct 2006
|
Hi thanks for the tip. I had been trying to download this for some time without success. I must admit I'm still put off by the fact that it's a 'command-line application' - whatever that is (I'm on OS X and I keep well away from the Terminal).
Oh and you can do a lot of this with a text editor that understands regular expressions (is anyone else using these BTW?), with a fairly simple expression changing _word_ to word can be done in seconds. |
Advert | |
|
09-20-2007, 05:43 PM | #3 | |
Enthusiast
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
|
Quote:
The basic syntax for running GutenMark is to open a terminal, change to the directory where your books is stored, and type "gutenmark file.txt file.html". When you hit enter, the program writes a few diagnostic lines to the terminal, writes your file, and exits. When the program exits, your terminal gives you a prompt, letting you know that it's ready for more commands. While each of these items is individually pretty easy to do, it took my computer about twenty seconds to reformat Anna Karenina, while the same task took me about 3 hours, even with some other tools to help me out. On the reader, Anna Karenina is an almost 2000 page book at small font size, or over 4000 pages at large size. Given that, I think that the 20 seconds time frame is pretty amazing. I haven't yet taken the time to run the result through either BD or HTML2LRF, so I can't guarantee that it'll look great, but it sure looks good in my web browser! Phrodod |
|
09-20-2007, 09:38 PM | #4 |
Gutenberger
Posts: 142
Karma: 700
Join Date: Jul 2007
Location: Lisbon, Portugal
Device: Cybook Gen 3
|
Guiguts
We also use a tool called Guiguts for auto generating html.
Guiguts isn't a command-line program so you have an user interface. See the installation instructions for: Windows, Mac, and Linux. After having Guiguts working you should copy, paste and save PG's text into it and run Fixup/HTML Fixup/Autogenerate HTML. Last edited by ricdiogo; 09-20-2007 at 09:53 PM. Reason: installation and usage |
09-30-2007, 09:17 PM | #5 |
Connoisseur
Posts: 57
Karma: 50
Join Date: Jun 2007
Device: Sony Reader PRS-500
|
I've been trying all evening and there is always "7 users on" and I should try again later.
|
Advert | |
|
10-01-2007, 02:43 AM | #6 | |
Groupie
Posts: 189
Karma: 793
Join Date: Oct 2006
|
Quote:
20 seconds is pretty good (though 3 hours sounds like rather a lot). I am wondering though: does it replace quotes with curly quotes and text marked with forward slashes (or underscores or block capitals) with italics as claimed on the features page. The reason for asking is that manybooks.net uses GutenMark and the texts I've seen don't have curly quotes etc. |
|
10-02-2007, 07:47 AM | #7 |
Addict
Posts: 205
Karma: 317
Join Date: Oct 2006
Location: England
Device: Sony PRS-505, iPad, Kindle 3
|
Don't forget gutlrf - https://www.mobileread.com/forums/showthread.php?t=8532 It uses Gutenmark to automatically to convert text PG titles into LRF files.
|
10-02-2007, 01:28 PM | #8 | |
Enthusiast
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
|
Quote:
Phrodod |
|
10-02-2007, 01:28 PM | #9 | |
Enthusiast
Posts: 43
Karma: 28
Join Date: Aug 2007
Device: Sony Reader PRS-500
|
Quote:
I admit that my 3-hour attempt on Anna Karenina was attempting to do more than simply replace underscores with italics. I was playing around with splitting the file up to make a reasonably-sized TOC for the reader, with internal TOCs for each "part" (of which AK has 8). While it worked out well, I agree that 3 hours is excessive, which is why I had been looking for an alternative. Phrodod Last edited by phrodod; 10-02-2007 at 01:32 PM. |
|
10-02-2007, 01:35 PM | #10 |
Evangelist
Posts: 430
Karma: 2718
Join Date: May 2006
Device: Iliad
|
PG is actually in the process of converting many of their texts to html, and when they do most of them are adding the images in the original book or magazine (some of the early Scientific American magazines are a blast from the past). Depending on the book, it might be one image in the front, up to several images per chapter.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best way to convert HTML? | enarchay | Sony Reader | 1 | 05-25-2009 10:21 AM |
Make Ebooks Pretty with GutenMark | Timoleon | Workshop | 4 | 03-09-2009 06:22 PM |
GutenMark | Bellonius | Software | 1 | 09-03-2008 01:38 AM |
Convert LRF to HTML | ggbal | Calibre | 5 | 08-08-2008 02:34 PM |
How do I convert my A4 HTML books to A5 PDF? | CommanderROR | iRex | 18 | 07-21-2006 03:50 PM |