Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 11-02-2007, 05:08 PM   #1
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
What "Cleaning Up" Do Project Gutenberg Texts Need [closed]

Editor's Note: bowerbird graciously allowed us to move this post to its own thread, it was originally posted here.
-- NatCh


vivaldirules said:
> I had dreamy visions of downloading all of Project Gutenberg and
> carrying a fair fraction of all mankind's knowledge and wisdom with me
> everywhere. This was to be a monumental step-change in my life.
> The first thing I did was to download a book from there in TXT format,
> import and copy it to my Reader, and with great anticipation I sat down
> to enjoy it. I was instantly dismayed by how distracting it was to read
> with broken lines, forced hyphenation, poor pagination, and no indexing.

this has been something i've been working on for some time...

and, ironically, i just sent a message to the p.g. listserves yesterday,
constructing a list of the things that need to be done to a p.g. e-text
in order to make it typographically beautiful, and asking for input...
so i will repeat the list -- and the request for input -- here, for you...

> the idea is that you've loaded a plain-ascii p.g. e-text into
> your word-processor or desktop-publishing program with
> the objective of making it beautiful. what exactly do you do?
>
> please add to this, the start of a list, off the top of my head:
>
> 1. get rid of that ugly legalese at the top of the file.
> 2. make the title-page and front-matter look nice.
> 3. hotlink the table of contents. make one if necessary.
> 4. make all the headers big, bold, and distinctive, and
> 5. start chapters on a new page, maybe even a recto.
> 6. get rid of the empty lines between paragraphs, and
> 7. use book-style indents on each paragraph instead.
> 8. use full justification. or at least half-ragged.
> 9. use a reasonable line-width. full-screen is too wide.
> 10. white-space is free in an e-book, so use it liberally.
> 11. make block-quotes distinctive, for remix purposes.
> 12. links are great, but spare us the ugly blue underlines.
> 13. is an unlucky number.
> 14. don't put pagenumbers inside the text/paragraphs.
> 15. turn pg-ascii underscored text into _real_ italics.
> 16. pictures (even doodad thingees) enliven the text.
> 17. navigation aids among chapters are quite useful.
> 18. footnotes should have links going _both_ ways.
> 19. if it works better that way, turn a table on its side.
> 20. resize tables and images so they fit on one screen.
> 21. give your readers the luxury of generous leading!
> 22. block-quotes should be indented on the left and right.
> 23. create running heads and/or footers on each page.
> 24. (leaving some space for you...)
> 25. (leaving some space for you...)
> 26. show where we are in the book (page 39 of 208).
> 27. make the framework of the document _obvious_.
> 28. what the heck, just for the fun of it, make an index!
> 29. make the typesize big enough to be read easily!
> 30. get rid of that ugly legalese at the bottom of the file.
>
> these are general strategies. not all of them will be
> applicable to any one specific situation, and some
> (e.g., #8) are up to the preferences of the individual.
>
> and obviously, some of these could be fragmented
> into a very large number of sub-points, like #10...

again, if there's anything you can add, i would appreciate it.

my aim is to write a program that will do a mass-beautification
of the entire project gutenberg library. i've made good progress.


> I don't have to tell you what happened when I then tried PDF files
> from the Internet Archive.

i wish you would've told us. i assume the text was too small to read.

-bowerbird
bowerbird is offline  
Old 11-02-2007, 05:18 PM   #2
vivaldirules
When's Doughnut Day?
vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.vivaldirules ought to be getting tired of karma fortunes by now.
 
vivaldirules's Avatar
 
Posts: 10,047
Karma: 13675425
Join Date: Jul 2007
Location: Houston, TX, US
Device: Sony PRS-505, iPad
Quote:
Originally Posted by bowerbird View Post
i wish you would've told us. i assume the text was too small to read.
Yes, all the PDFs there are from scanned images that are far larger than the Sony screen and require some enhancement from PDFLRF (another useful workaround by MR folks) or something similar. That means a lot more time and effort fooling around and ends with huge files when all I was after in the first place was some simple well-formatted text.

Nice list of wants from PG. I'll give this more thought.
vivaldirules is offline  
Old 11-02-2007, 05:27 PM   #3
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 9,780
Karma: 5072196
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2
Quote:
Originally Posted by bowerbird View Post
vivaldirules said:

again, if there's anything you can add, i would appreciate it.

my aim is to write a program that will do a mass-beautification
of the entire project gutenberg library. i've made good progress.


> I don't have to tell you what happened when I then tried PDF files
> from the Internet Archive.

i wish you would've told us. i assume the text was too small to read.

-bowerbird
Part of beautification I do is to make curly quotes (double and single), curly apostrophes, and dashes that are really dashes. Sometimes I adjust hyphenation to make better looking lines.

On small screens the margin needs to be almost as wide or as wide as the screen. To many page changes interrupts the reading pleasure.

Images should be adjusted for color (perhaps converted to gray scale) and adjusted for the screen.

For gutenberg you need to check paragraph splits, bad scan errors and other items. You almost have to read the whole book I am afraid to get it right.

Dale
DaleDe is offline  
Old 11-02-2007, 05:45 PM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 38,483
Karma: 19300555
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Aura H2), Sony PRS-650, Sony PRS-T1, nook STR, iPad 1, iPhone 5
The thing to do is take the book and convert it. Then read it and fix it and post the fixes. Also if someone else is reading it then please post any corrections you find so they can be fixed. Then a new version can be made and posted. Because it's not feasible to read before posting. But, Like Music of the Spheres I did fix a few things I found right away and fixed other things as I read it till I was done and fixed all I found.
JSWolf is online now  
Old 11-02-2007, 05:46 PM   #5
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,605
Karma: 926222
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Nook STGR
What you say is true, JSWolf, but bowerbird is trying to come up with an app to fix as much of the obvious stuff automagically as possible.
NatCh is offline  
Old 11-02-2007, 05:49 PM   #6
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 38,483
Karma: 19300555
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Aura H2), Sony PRS-650, Sony PRS-T1, nook STR, iPad 1, iPhone 5
Quote:
26. show where we are in the book (page 39 of 208)
Isn't that up to the software doing the displaying of the file as to how it displays the page number?
JSWolf is online now  
Old 11-02-2007, 05:51 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 38,483
Karma: 19300555
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Aura H2), Sony PRS-650, Sony PRS-T1, nook STR, iPad 1, iPhone 5
Quote:
Originally Posted by NatCh View Post
What you say is true, JSWolf, but bowerbird is trying to come up with an app to fix as much of the obvious stuff automatically as possible.
It should not be too hard to make such an app. to do some of what's needed an initial clean up. The hardest think I think for a lot of people is the page numbers. I know Word and Book Designer have regexp. But if you don't know it or know you can use it to remove the page numbers, then you will either have to do it manually or leave them in or not convert.
JSWolf is online now  
Old 11-02-2007, 07:31 PM   #8
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
dalede said:
> curly quotes (double and single), curly apostrophes,
> and dashes that are really dashes.

curly-quotes and em-dashes, how could i forget those? :+)
oh well, like it said, it was a list off the top of my head...

***

jswolfe said:
> Isn't that up to the software
> doing the displaying of the file as to
> how it displays the page number?

yes sir, there is a mixture of things in the list,
some of which aren't relevant for all situations,
and some of which are geared to functionality,
not beauty. (except functionality _is_ beautiful.)
practically none of them are fully cut-and-dried.

for that particular item, i was thinking of .pdf,
and other formats of the fixed-page persuasion,
where the number of total pages is known for
any particular conversion (e.g., at textsize=12),
so as to be a basis for a relativistic comparison
with another conversion (e.g., at textsize=16)...

something like "page 180" has very little meaning
if we don't know if there are 200 or 800 or 1600
pages in any one particular conversion of a book.


> It should not be too hard to make such an app.
> to do some of what's needed an initial clean up.

actually, it's not really easy, for the simple reason that
p.g. e-texts have maddeningly inconsistent formatting...

even where there is a straightforward rule on something --
e.g., there should be 4 blank lines before a chapter heading
-- consistency checks weren't done to ensure that's the case.
so even something as relatively simple as finding headings
must be engineered to catch inconsistencies, and of course,
since you don't _know_ all the ways they were inconsistent,
it's not that simple to know what code you have to engineer.

the worst part of all -- as i'm sure you volunteers who have
converted p.g. e-texts already know -- is that p.g. employed
no method to inform us what lines should _not_ be rewrapped,
such as lines of poetry, lines in a table, lines in address-blocks,
and so on. finding and fixing these lines can be time-consuming.
and writing routines that can root them out is not entirely trivial.

my intention is to fix the files, and mount a mirror with my files.
michael hart graciously agreed to provide diskspace/bandwidth...

-bowerbird

p.s. i agree entirely that whatever beautification is done needs to
be subjected to the quality-control process of being read by people.
my hope is that the beauty of the files will be an alluring invitation.
for my part, i intend to make error-reporting and feedback _much_
more simple and responsive than it is with project gutenberg itself.
it will be more wiki-like, in that reports will be immediately visible...
bowerbird is offline  
Old 11-02-2007, 07:49 PM   #9
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 10,604
Karma: 3586209
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
This first set will be fairly easy to implement (if outputting in HTML).
> 1. get rid of that ugly legalese at the top of the file.
> 3. hotlink the table of contents. make one if necessary.
> 4. make all the headers big, bold, and distinctive, and
> 6. get rid of the empty lines between paragraphs, and
> 7. use book-style indents on each paragraph instead.
> 8. use full justification. or at least half-ragged.
> 10. white-space is free in an e-book, so use it liberally.
> 11. make block-quotes distinctive, for remix purposes.
> 12. links are great, but spare us the ugly blue underlines.
> 15. turn pg-ascii underscored text into _real_ italics.
> 18. footnotes should have links going _both_ ways.
> 22. block-quotes should be indented on the left and right.
> 29. make the typesize big enough to be read easily!
> 30. get rid of that ugly legalese at the bottom of the file.


I am not sure what these mean. Can someone elaborate?
> 27. make the framework of the document _obvious_.
> 9. use a reasonable line-width. full-screen is too wide.
> 14. don't put pagenumbers inside the text/paragraphs.
> 17. navigation aids among chapters are quite useful.
> 21. give your readers the luxury of generous leading!


The following are indeterminate because they are dependent on input or output. For instance, PG ASCII files don't really have tables. Creating a definition sufficiently broad to cover all possibilities but not screw with the surrounding text will be an interesting exercise.
> 5. start chapters on a new page, maybe even a recto.
> 16. pictures (even doodad thingees) enliven the text.
> 19. if it works better that way, turn a table on its side.
> 20. resize tables and images so they fit on one screen.
> 23. create running heads and/or footers on each page.
> 26. show where we are in the book (page 39 of 208).
> 28. what the heck, just for the fun of it, make an index! (This last one might take a lot of computing power. )

@bowerbird
Will the output be in HTML? That's the closest I know to a universal file type. You could create a BD file (in HTML0). I don't know yet if the specs are accessible.

Are you familiar with flex and yacc? They are what I would use to do this.


EDIT: @bowerbird I did not see your post until after I posted mine. Some of my questions have been answered.

Last edited by Nate the great; 11-02-2007 at 07:52 PM.
Nate the great is online now  
Old 11-02-2007, 09:32 PM   #10
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
nate said:
> Will the output be in HTML? That's the closest I know to a universal file type.

i will transform the pg-ascii files into my own format -- z.m.l. --
which stands for "zen markup language", a light-markup system.

i designed z.m.l. based on pg-ascii, to speed the transformation.

or, more accurately, to tell the story the way it really evolved,
i original wrote a viewer-program for project gutenberg e-texts.

what happened was that writing the viewer-program was easy,
but resolving inconsistencies in the p.g. e-texts made it complex.

at some point, i threw up my hands and said, "it will be easier to
create a set of dirt-simple rules, and then make all the e-texts be
consistent with that rule-set, than continue efforts at overcoming
never-ending p.g. inconsistencies." fortunately, i'd already coded
routines to resolve the bulk of the inconsistencies, so it was simple
to "convert" an e-text just by outputting a _consistent_ version of it.
(for example, making sure it had 4 blank lines before each header.)

in the end, it's better to have a consistent library, because then other
developers can concentrate on adding value, instead of parsing text...

z.m.l. is a set of simple rules for expressing the _structure_ of a book.
that is to say, all the structural elements are indicated in a unique way.

this means a zml-viewer can take a z.m.l. file as _input_ and render it.

it also means that converter-routines can transform a z.m.l. file into
outputs of various types. thus far, i have focused on .html and .pdf...

z.m.l. viewer-apps are easy to program, because the z.m.l. format is simple.
i've already written various iterations in 3 languages, including basic and perl.

i'm biased, of course, but i think my viewer kicks other programs to the curb...
so, eventually, if you've got a book in z.m.l., you won't even want to convert it,
because there will be a zml-viewer-program that will run wherever it's needed...

long-run, i even expect browsers to accept z.m.l. and display it correctly.

here's a webpage where you can see zml-to-html canned demos:
> http://z-m-l.com/go/vl3.pl
click the linked book-titles to see the z.m.l., or the button to convert it.

if you're brave, you can even experiment with live zml-to-html conversion:
> http://z-m-l.com/go/zmldingus093.pl
click in the same canned demos from the url above, or click "skeleton"
to bring in a skeleton book, which you can edit, and then click "do it"...

-bowerbird
bowerbird is offline  
Old 11-02-2007, 09:40 PM   #11
jbenny
Addict
jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.
 
Posts: 323
Karma: 358
Join Date: May 2007
Device: Tablet PC and Nokia N800
Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer. Both free. They can be used separately or in combination, as they each have unique capabilities.
jbenny is offline  
Old 11-03-2007, 02:58 AM   #12
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
jbenny said:
> Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer.

um, no. not even close, really. honorable efforts, but my intention is much wider.

-bowerbird
bowerbird is offline  
Old 11-03-2007, 03:20 AM   #13
Robert Marquard
Delphi-Guy
Robert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheeseRobert Marquard can extract oil from cheese
 
Robert Marquard's Avatar
 
Posts: 285
Karma: 1151
Join Date: May 2006
Location: Berlin, Germany
Device: iLiad, Palm T3
Please, bowerbird is pestering PG with his ZML for a long time. Distributed Proofreaders has banned him lately. Please do not let him promote his inadequate format here.
Robert Marquard is offline  
Old 11-03-2007, 05:04 AM   #14
HappyMartin
Martin Kristiansen
HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.HappyMartin ought to be getting tired of karma fortunes by now.
 
HappyMartin's Avatar
 
Posts: 1,460
Karma: 6786378
Join Date: Aug 2007
Location: Johannesburg
Device: Kindle International Ipad 2
A lot of this stuff is way over my head. The only thing I'm not keen on is the "white space is free so use it liberally". I have some books with so much white space that I need to change pages very frequently. Drains the battery and slows everything down with the slower e ink refresh rates. Just a suggestion.
HappyMartin is offline  
Old 11-03-2007, 10:36 AM   #15
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
So his questions were really a set-up for his ZML scheme.

Just what we need, another new format.

I'll stick to what I use now thank you.
RWood is offline  
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
The "Closed Circle" is open for business pholy Deals, Freebies, and Resources (No Self-Promotion) 0 12-20-2009 10:24 PM
"SuperBook" project - British School studies e-books usage TadW News 2 06-28-2007 11:46 PM
Introducing the book: Gutenberg offers "in-home" tech support (humor) nekokami Lounge 1 05-07-2007 09:40 PM
"Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad Hadrien News 4 03-27-2007 12:45 PM


All times are GMT -4. The time now is 02:37 PM.


MobileRead.com is a privately owned, operated and funded community.