|  11-02-2007, 04:08 PM | #1 | 
| Banned    Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles | 
				
				What "Cleaning Up" Do Project Gutenberg Texts Need [closed]
			 
			
			Editor's Note:  bowerbird graciously allowed us to move this post to its own thread, it was originally posted here. -- NatCh vivaldirules said: > I had dreamy visions of downloading all of Project Gutenberg and > carrying a fair fraction of all mankind's knowledge and wisdom with me > everywhere. This was to be a monumental step-change in my life. > The first thing I did was to download a book from there in TXT format, > import and copy it to my Reader, and with great anticipation I sat down > to enjoy it. I was instantly dismayed by how distracting it was to read > with broken lines, forced hyphenation, poor pagination, and no indexing. this has been something i've been working on for some time... and, ironically, i just sent a message to the p.g. listserves yesterday, constructing a list of the things that need to be done to a p.g. e-text in order to make it typographically beautiful, and asking for input... so i will repeat the list -- and the request for input -- here, for you... > the idea is that you've loaded a plain-ascii p.g. e-text into > your word-processor or desktop-publishing program with > the objective of making it beautiful. what exactly do you do? > > please add to this, the start of a list, off the top of my head: > > 1. get rid of that ugly legalese at the top of the file. > 2. make the title-page and front-matter look nice. > 3. hotlink the table of contents. make one if necessary. > 4. make all the headers big, bold, and distinctive, and > 5. start chapters on a new page, maybe even a recto. > 6. get rid of the empty lines between paragraphs, and > 7. use book-style indents on each paragraph instead. > 8. use full justification. or at least half-ragged. > 9. use a reasonable line-width. full-screen is too wide. > 10. white-space is free in an e-book, so use it liberally. > 11. make block-quotes distinctive, for remix purposes. > 12. links are great, but spare us the ugly blue underlines. > 13. is an unlucky number. > 14. don't put pagenumbers inside the text/paragraphs. > 15. turn pg-ascii underscored text into _real_ italics. > 16. pictures (even doodad thingees) enliven the text. > 17. navigation aids among chapters are quite useful. > 18. footnotes should have links going _both_ ways. > 19. if it works better that way, turn a table on its side. > 20. resize tables and images so they fit on one screen. > 21. give your readers the luxury of generous leading! > 22. block-quotes should be indented on the left and right. > 23. create running heads and/or footers on each page. > 24. (leaving some space for you...) > 25. (leaving some space for you...) > 26. show where we are in the book (page 39 of 208). > 27. make the framework of the document _obvious_. > 28. what the heck, just for the fun of it, make an index! > 29. make the typesize big enough to be read easily! > 30. get rid of that ugly legalese at the bottom of the file. > > these are general strategies. not all of them will be > applicable to any one specific situation, and some > (e.g., #8) are up to the preferences of the individual. > > and obviously, some of these could be fragmented > into a very large number of sub-points, like #10... again, if there's anything you can add, i would appreciate it. my aim is to write a program that will do a mass-beautification of the entire project gutenberg library. i've made good progress. > I don't have to tell you what happened when I then tried PDF files > from the Internet Archive. i wish you would've told us. i assume the text was too small to read. -bowerbird | 
|   | 
|  11-02-2007, 04:18 PM | #2 | |
| When's Doughnut Day?            Posts: 10,059 Karma: 13675475 Join Date: Jul 2007 Location: Houston, TX, US Device: Sony PRS-505, iPad | Quote: 
 Nice list of wants from PG. I'll give this more thought. | |
|   | 
|  11-02-2007, 04:27 PM | #3 | |
| Grand Sorcerer            Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7 | Quote: 
 On small screens the margin needs to be almost as wide or as wide as the screen. To many page changes interrupts the reading pleasure. Images should be adjusted for color (perhaps converted to gray scale) and adjusted for the screen. For gutenberg you need to check paragraph splits, bad scan errors and other items. You almost have to read the whole book I am afraid to get it right. Dale | |
|   | 
|  11-02-2007, 04:45 PM | #4 | 
| Resident Curmudgeon            Posts: 80,675 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			The thing to do is take the book and convert it. Then read it and fix it and post the fixes. Also if someone else is reading it then please post any corrections you find so they can be fixed. Then a new version can be made and posted. Because it's not feasible to read before posting. But, Like Music of the Spheres I did fix a few things I found right away and fixed other things as I read it till I was done and fixed all I found.
		 | 
|   | 
|  11-02-2007, 04:46 PM | #5 | 
| Gizmologist            Posts: 11,615 Karma: 929550 Join Date: Jan 2006 Location: Republic of Texas Embassy at Jackson, TN Device: Pocketbook Touch HD3 | 
			
			What you say is true, JSWolf, but bowerbird is trying to come up with an app to fix as much of the obvious stuff automagically as possible.     | 
|   | 
|  11-02-2007, 04:49 PM | #6 | |
| Resident Curmudgeon            Posts: 80,675 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | Quote: 
 | |
|   | 
|  11-02-2007, 04:51 PM | #7 | 
| Resident Curmudgeon            Posts: 80,675 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			It should not be too hard to make such an app. to do some of what's needed an initial clean up. The hardest think I think for a lot of people is the page numbers. I know Word and Book Designer have regexp. But if you don't know it or know you can use it to remove the page numbers, then you will either have to do it manually or leave them in or not convert.
		 | 
|   | 
|  11-02-2007, 06:31 PM | #8 | 
| Banned    Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles | 
			
			dalede said: > curly quotes (double and single), curly apostrophes, > and dashes that are really dashes. curly-quotes and em-dashes, how could i forget those? :+) oh well, like it said, it was a list off the top of my head... *** jswolfe said: > Isn't that up to the software > doing the displaying of the file as to > how it displays the page number? yes sir, there is a mixture of things in the list, some of which aren't relevant for all situations, and some of which are geared to functionality, not beauty. (except functionality _is_ beautiful.) practically none of them are fully cut-and-dried. for that particular item, i was thinking of .pdf, and other formats of the fixed-page persuasion, where the number of total pages is known for any particular conversion (e.g., at textsize=12), so as to be a basis for a relativistic comparison with another conversion (e.g., at textsize=16)... something like "page 180" has very little meaning if we don't know if there are 200 or 800 or 1600 pages in any one particular conversion of a book. > It should not be too hard to make such an app. > to do some of what's needed an initial clean up. actually, it's not really easy, for the simple reason that p.g. e-texts have maddeningly inconsistent formatting... even where there is a straightforward rule on something -- e.g., there should be 4 blank lines before a chapter heading -- consistency checks weren't done to ensure that's the case. so even something as relatively simple as finding headings must be engineered to catch inconsistencies, and of course, since you don't _know_ all the ways they were inconsistent, it's not that simple to know what code you have to engineer. the worst part of all -- as i'm sure you volunteers who have converted p.g. e-texts already know -- is that p.g. employed no method to inform us what lines should _not_ be rewrapped, such as lines of poetry, lines in a table, lines in address-blocks, and so on. finding and fixing these lines can be time-consuming. and writing routines that can root them out is not entirely trivial. my intention is to fix the files, and mount a mirror with my files. michael hart graciously agreed to provide diskspace/bandwidth... -bowerbird p.s. i agree entirely that whatever beautification is done needs to be subjected to the quality-control process of being read by people. my hope is that the beauty of the files will be an alluring invitation. for my part, i intend to make error-reporting and feedback _much_ more simple and responsive than it is with project gutenberg itself. it will be more wiki-like, in that reports will be immediately visible... | 
|   | 
|  11-02-2007, 06:49 PM | #9 | 
| Sir Penguin of Edinburgh            Posts: 12,375 Karma: 23555235 Join Date: Apr 2007 Location: DC Metro area Device: Shake a stick plus 1 | 
			
			This first set will be fairly easy to implement (if outputting in HTML). > 1. get rid of that ugly legalese at the top of the file. > 3. hotlink the table of contents. make one if necessary. > 4. make all the headers big, bold, and distinctive, and > 6. get rid of the empty lines between paragraphs, and > 7. use book-style indents on each paragraph instead. > 8. use full justification. or at least half-ragged. > 10. white-space is free in an e-book, so use it liberally. > 11. make block-quotes distinctive, for remix purposes. > 12. links are great, but spare us the ugly blue underlines. > 15. turn pg-ascii underscored text into _real_ italics. > 18. footnotes should have links going _both_ ways. > 22. block-quotes should be indented on the left and right. > 29. make the typesize big enough to be read easily! > 30. get rid of that ugly legalese at the bottom of the file. I am not sure what these mean. Can someone elaborate? > 27. make the framework of the document _obvious_. > 9. use a reasonable line-width. full-screen is too wide. > 14. don't put pagenumbers inside the text/paragraphs. > 17. navigation aids among chapters are quite useful. > 21. give your readers the luxury of generous leading! The following are indeterminate because they are dependent on input or output. For instance, PG ASCII files don't really have tables. Creating a definition sufficiently broad to cover all possibilities but not screw with the surrounding text will be an interesting exercise. > 5. start chapters on a new page, maybe even a recto. > 16. pictures (even doodad thingees) enliven the text. > 19. if it works better that way, turn a table on its side. > 20. resize tables and images so they fit on one screen. > 23. create running heads and/or footers on each page. > 26. show where we are in the book (page 39 of 208). > 28. what the heck, just for the fun of it, make an index! (This last one might take a lot of computing power. ) @bowerbird Will the output be in HTML? That's the closest I know to a universal file type. You could create a BD file (in HTML0). I don't know yet if the specs are accessible. Are you familiar with flex and yacc? They are what I would use to do this. EDIT: @bowerbird I did not see your post until after I posted mine. Some of my questions have been answered. Last edited by Nate the great; 11-02-2007 at 06:52 PM. | 
|   | 
|  11-02-2007, 08:32 PM | #10 | 
| Banned    Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles | 
			
			nate said: > Will the output be in HTML? That's the closest I know to a universal file type. i will transform the pg-ascii files into my own format -- z.m.l. -- which stands for "zen markup language", a light-markup system. i designed z.m.l. based on pg-ascii, to speed the transformation. or, more accurately, to tell the story the way it really evolved, i original wrote a viewer-program for project gutenberg e-texts. what happened was that writing the viewer-program was easy, but resolving inconsistencies in the p.g. e-texts made it complex. at some point, i threw up my hands and said, "it will be easier to create a set of dirt-simple rules, and then make all the e-texts be consistent with that rule-set, than continue efforts at overcoming never-ending p.g. inconsistencies." fortunately, i'd already coded routines to resolve the bulk of the inconsistencies, so it was simple to "convert" an e-text just by outputting a _consistent_ version of it. (for example, making sure it had 4 blank lines before each header.) in the end, it's better to have a consistent library, because then other developers can concentrate on adding value, instead of parsing text... z.m.l. is a set of simple rules for expressing the _structure_ of a book. that is to say, all the structural elements are indicated in a unique way. this means a zml-viewer can take a z.m.l. file as _input_ and render it. it also means that converter-routines can transform a z.m.l. file into outputs of various types. thus far, i have focused on .html and .pdf... z.m.l. viewer-apps are easy to program, because the z.m.l. format is simple. i've already written various iterations in 3 languages, including basic and perl. i'm biased, of course, but i think my viewer kicks other programs to the curb... so, eventually, if you've got a book in z.m.l., you won't even want to convert it, because there will be a zml-viewer-program that will run wherever it's needed... long-run, i even expect browsers to accept z.m.l. and display it correctly. here's a webpage where you can see zml-to-html canned demos: > http://z-m-l.com/go/vl3.pl click the linked book-titles to see the z.m.l., or the button to convert it. if you're brave, you can even experiment with live zml-to-html conversion: > http://z-m-l.com/go/zmldingus093.pl click in the same canned demos from the url above, or click "skeleton" to bring in a skeleton book, which you can edit, and then click "do it"... -bowerbird | 
|   | 
|  11-02-2007, 08:40 PM | #11 | 
| Addict     Posts: 323 Karma: 358 Join Date: May 2007 Device: Tablet PC and Nokia N800 | 
			
			Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer. Both free. They can be used separately or in combination, as they each have unique capabilities.
		 | 
|   | 
|  11-03-2007, 01:58 AM | #12 | 
| Banned    Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles | 
			
			jbenny said: > Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer. um, no. not even close, really. honorable efforts, but my intention is much wider. -bowerbird | 
|   | 
|  11-03-2007, 02:20 AM | #13 | 
| Delphi-Guy          Posts: 285 Karma: 1151 Join Date: May 2006 Location: Berlin, Germany Device: iLiad, Palm T3 | 
			
			Please, bowerbird is pestering PG with his ZML for a long time. Distributed Proofreaders has banned him lately. Please do not let him promote his inadequate format here.
		 | 
|   | 
|  11-03-2007, 04:04 AM | #14 | 
| Martin Kristiansen            Posts: 1,546 Karma: 8480958 Join Date: Aug 2007 Location: Johannesburg Device: Kindle International Ipad 2 | 
			
			A lot of this stuff is way over my head. The only thing I'm not keen on is the "white space is free so use it liberally". I have some books with so much white space that I need to change pages very frequently. Drains the battery and slows everything down with the slower e ink refresh rates. Just a suggestion.
		 | 
|   | 
|  11-03-2007, 09:36 AM | #15 | 
| Technogeezer            Posts: 7,233 Karma: 1601464 Join Date: Nov 2006 Location: Virginia, USA Device: Sony PRS-500 | 
			
			So his questions were really a set-up for his ZML scheme. Just what we need, another new format. I'll stick to what I use now thank you. | 
|   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| The "Closed Circle" is open for business | pholy | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 12-20-2009 09:24 PM | 
| "SuperBook" project - British School studies e-books usage | TadW | News | 2 | 06-28-2007 10:46 PM | 
| Introducing the book: Gutenberg offers "in-home" tech support (humor) | nekokami | Lounge | 1 | 05-07-2007 08:40 PM | 
| "Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad | Hadrien | News | 4 | 03-27-2007 11:45 AM |