|
|
View Full Version : Formatting Gutenberg txt Files
AnsgarSerif 10-23-2006, 01:54 AM This may or may not fit here, but I've been working on macros in OpenOffice.org and Microsoft Office to convert .txt files on Gutenberg to something more readable and "bookmarkable."
Since I'm new at macros and hate programming, I had to use both MS Word and OpenOffice. MS Word is able to insert page breaks at chapter and book headings (which OpenOffice can't, apparently) and doesn't randomly erase swaths of text when running the "End-of-line Remover" macro (which OpenOffice does consistently). The documentation (if I may be allowed to disgrace the name) is therefore somewhat complex but I'm hoping that can be fixed - and I think the end result is well worth the effort.
I have it streamlined to five macros. After about 10-20 minutes of fine-tuning, I can have a fully bookmarked, formatted and stylized book ready to read comfortably.
I looked around the forums but haven't found anything like what I can make from txt files, so I'd like to post up the macros, the template and the exported PDF I use for the Sony Reader. I imagine that people who know more about the programming side of macros could streamline this a lot better - or get things working completely in OpenOffice.org, at least.
If this interests anybody, take it and run with it (or start from scratch and do something much better). Note that the page format for the Sony Reader is 3.57"x4.82 - I don't have mine yet and I picked up these dimensions from somewhere in the past.
Thanks everyone,
Sam
EDIT: New Version 0.3.3
There was a bug in 0.3 that prevented applying the text body style to the entire project before applying heading styles, which made the text font and size inconsistent. Should be fixed now. I know, I know - absolutely no quality control on my part.
Here we go,
Major changes:
One macro to convert text (as opposed to five)
Everything (including page breaks) runs in OpenOffice.org
User prompt for Author, Title
Author, Title inserted in page headers
Separate macros for saving to RTF and PDF
Changes from 0.3
Primary text changed to Garamond 13
Primary text consistency bug fixed
Fixed infinite loop bug
Changes from 0.3.2
Page size changed to 5.24" x 6.69", since more people seemed to prefer that on the forums
Chapter Headings 2 and 3 no longer have a page break inserted before them (Sorry about that)
It's a bit on the slow side - about three minutes for a 600 page book (formatted for the Sony Reader's screen size) and about ten minutes for an 1800 page book.
Anybody got any good ideas I could add to this?
Sam
ChrisAllenFiz 10-23-2006, 02:44 AM Thanks for your work, I will have a go with these macros tonight
BobVA 10-23-2006, 08:34 PM Great idea! Thanks for posting your macros.
FYI, for text or RTF you don't need to format for the Reader's screen size. Just sort out the paragraph ends and the reader with take care of the rest.
If there's an HTML file, I'll import that into Word, save it as an RTF, "select all", bump up the primary text to 15 points (via the font "size up" button) change it to Arial font and then save it again. That process keeps any other larger or smaller fonts (chapter heads, for example) in a proportional size. I've got to "macro-ize" that process.
AnsgarSerif 10-23-2006, 09:41 PM Hey, BobVA
I definitely like your idea for HTML files - five minutes and you've got a classic all set to read!
Formatting for the screen size is due to my choice of PDF as the final format - I was under the impression that you can't make bookmarks in a RTF file. Since I like to jump to favorite chapters/parts of books (I mean, really, isn't "The Inferno" so much more fun to read after Circle 5?), I thought that bookmark support was pretty important, hence the PDF. You could probably set the bookmarks automatically in a .doc file but I didn't know how to do that. On the other hand, it would be much better to save the final document in a few different formats (RTF, DOC, ODT, PDF and BBeB whenever Sony releases the specs) - that way, there can be something for everyone!
Right now, I'm trying to look around for a way to input a dialog box so that the user can input the Author and Title before saving and then use that info to automatically create the file name - any ideas?
NatCh 10-24-2006, 10:05 AM Nice work, AnsgarSerif! Thanks for sharing the fruits of your labor. :)
Regarding RTF's & Bookmarks -- you can't set links in an RTF, like in a PDF or BBeB, but you can set bookmarks once you get the text into the Reader or Connect Software.
I just wanted to clarify the terminology, as it might confuse new folks. :beam:
BobVA 10-24-2006, 09:18 PM ..., but you can set bookmarks once you get the text into the Reader or Connect Software.
I just wanted to clarify the terminology, as it might confuse new folks.
Here's how I've been doing this for RTF's:
- Insert a page break at the start of the chapter in your word processor
- Put in a a one or two-character special flag (I use "||", but you can use anything that's not in the text) before the chapter headings.
- Save the file
- Import the file in Connect and open it.
- Use the search command to locate the flag strings and then click the bookmark button.
Just takes a few seconds to do this as you can repeat the "find/mark" without having to re-enter the flag characters; i.e. it's just a matter of two mouse clicks per bookmark after the first one.
The page break before the chapter starts isn't absolutely necessary, but it makes the bookmark screen on the Reader look a lot tidier.
Cheers,
Bob
NatCh 10-24-2006, 11:22 PM Nice system, a bit labor intensive, but not too burdensome. Thanks, BobVA!
bump up the primary text to 15 points (via the font "size up" button)
Ive been looking for how to do this in WORD, but can't find it. Any tips would be appreciated. thanks!
-lint
NatCh 10-25-2006, 12:43 PM You have to add the "Grow Font 1 Pt" button to your toolbar -- right click on the toolbar and choose "Customize," then find the tool under the "Format" catagory and drag it to one of your existing toolbars. :beam:
NatCh 10-25-2006, 01:27 PM You're welcome!
Bob Russell 10-25-2006, 03:48 PM Another way in MS Word without the button is to select all the text and then hit "Ctrl-Shft >" to increase font size or "Ctrl-Shft <" to reduce it.
NatCh 10-25-2006, 04:01 PM Really?!? Now that's news to me, and very welcome news at that! Now I won't have to dig up the button every time I'm on someone else's computer. Thanks, Bob!
Slava 10-25-2006, 06:28 PM Really?!? Now that's news to me, and very welcome news at that!
That is why, one has to read documentation :p
AnsgarSerif 10-25-2006, 07:04 PM Hey everybody,
Thanks for your posts so far!
Here's where I'm at right now. I've been getting help on the OpenOffice.org forum by JohnV, who is entirely responsible for providing the code that works around the problem of OpenOffice deleting text after 64K worth of characters. There's still an error that pops up (probably from me guessing what the names of search strings are) but as soon as I get that ironed out, I'll post a new macro. Hopefully, the laborious documentation will get a lot shorter soon.
I'm planning to create this macro to output an RTF file as well as a PDF - is there a way you can have the Connect Software search for a specific heading type? That would allow for near-automatic bookmarking with RTFs, I think.
These are the ideas that I'd like to incorporate at some later point:
User Prompt for Title, Author
Automatically save file name as Author, Title
Save in RTF, DOC, HTML, PDF and BBeb (eventually) with bookmarks for all supported documents
Format page based on user prompt (for PDF output to Sony Reader, iREX Iliad, Amazon Kindle or whatever)
Author and Title in Header (Is this already available in Sony Reader?)
User prompt for font, size
If anybody has any more ideas for this lil' project, let me know! If we can get enough features put into macros, maybe somebody could code a utility that calls soffice.bin as a service and does everything in the background.
Thanks all,
Sam
NatCh 10-25-2006, 08:48 PM That is why, one has to read documentation :pSacrilege! :laugh4:
AnsgarSerif 10-26-2006, 03:08 AM Here it is, version 0.2!
Aside from everything being in one macro (unless you really wanted chapter page-breaks), the only thing that's really new is that the macro outputs a RTF file and a PDF file onto the desktop. Everything being in one macro is pretty big, though!
I'm indebted to John Vigor, who gave me a great bit of code that made everything possible - everyone give him props if he ever comes over here from the OOo forum! Thanks to him, we've now got a single macro that needs to be run to convert .txt files into something readable and bookmark-able. There are still quite a few things that would make this even better - but I think I'll take a day off first :happy2:
If anybody wants to run this and leave some feedback (or instructions on how to put page breaks into an OOo macro!), I'd love to hear from ya. The documentation is much more fluid now, mostly just explaining what everything does and including a few pictures to help those who don't often use OpenOffice.org.
Thanks, everyone - now I can finally go to bed!
Sam
EDIT: The newest version is available on the first post.
AnsgarSerif 10-26-2006, 01:41 PM A quick note:
I just realized that opening the outputted RTF file in Word, running the Page_Break macro and re-opening it in OpenOffice will change all the headings and prevent the exported PDF file from marking any bookmarks.
You can prevent this by inserting the RTF as a file:
1) Finish the Page_Break macro and close the file
2) Open the Project Gutenberg Conversion Template.ott file to create a new document
3) Goto Insert > File ... and select the RTF
This will convert Word headings to the proper OpenOffice headings and preserve bookmarking capability in RTF. I've uploaded the new template file, since the one from last night was saved as a document, not a template. Sorry about that - I guess the mind starts to dwindle after 2 AM. Everything besides the file type is exactly the same.
Sam
EDIT: The newest version is available on the first post.
A quick note:
I just realized that opening the outputted RTF file in Word, running the Page_Break macro and re-opening it in OpenOffice will change all the headings and prevent the exported PDF file from marking any bookmarks.
You can prevent this by inserting the RTF as a file:
1) Finish the Page_Break macro and close the file
2) Open the Project Gutenberg Conversion Template.ott file to create a new document
3) Goto Insert > File ... and select the RTF
This will convert Word headings to the proper OpenOffice headings and preserve bookmarking capability in RTF. I've uploaded the new template file, since the one from last night was saved as a document, not a template. Sorry about that - I guess the mind starts to dwindle after 2 AM. Everything besides the file type is exactly the same.
Sam
I'd like to say that this is a very nice macro you did and it does help quite a bit :D I've been able to format a book in just a few minutes. The thing is, I'm a bit confused on this area
You can set the title to “Title,” the by-line to “Byline,” the author to “Author” – you see where this is going. If there are sub-chapter titles, I like to give them a style like “Subtitle” or “Sony Reader – Chapter Heading 3” and then go to Tools > Outline Numbering and change the drop-down boxes to reflect the applied style. This is so that, when you export the document to PDF, the sub-chapter titles will appear as a sub-bookmark of the chapter.
(sorry for not being able to figure it out, I just don't want to change something and have it end up doing something weird :p.)
Also, I tried the new macro in word for page breaks and well it did "break the page" so to speak. It ended up cutting the first few pages in half and not doing much else for the rest of the book (could you post up an example of something you've converted using your macro and what a final product should look like? This way I can know I'm doing things right.) Thanks again for all the hard work you've put into this. I for one appreciate it :).
AnsgarSerif 10-26-2006, 11:06 PM njt,
Sorry that the Word macro broke the document - I tested it once (on the same document I recorded the macro on - big mistake) at around 2 in the morning. The GOOD news is that there was a really simple way to insert the page breaks in OpenOffice. I just had to format the chapter and book styles properly. I've included author and document headers that are automatically populated and I'm working on cleaning the code right now and then implementing an automatic file name. As soon as I get that done, I'll post up a new template. Everything's in one macro set now!
All of the styles that I use are in the "Styles and Formatting" window under the category "Custom Styles." You can modify the styles as you see fit (changing the font and size and what-not) without doing any damage. I'll attach a PDF with the different styles as they'd look in a document - basically, you can use the styles to fine-tune things like subtitles and sub-chapter headings to make the text look better.
Right now, the PDF bookmarks are exported in this tree:
Sony Reader - Title
Sony Reader - Book Heading
Sony Reader - Chapter Heading 1
Sony Reader - Chapter Heading 2
This will create a primary bookmark using the Title, a sub-bookmark using the Book Headings (Like in "Moby Dick," where there are multiple "books") and a sub-bookmark of that using Chapter Heading 1. If you want to change the way the PDF bookmarks, you can go to Tools > Outline Numbering and select the style you want bookmarked from the drop-down box. Personally, I sometimes like to bookmark the lines I've styled "Sony Reader - Chapter Heading 3" to show up in the bookmark, as well (like the journal entries in "Dracula" or the sections of Locke's "Two Treatises on Government").
I hope that answered your question and I'll do my best to have the new macro up before I go to bed tonight. I've attached Locke's "Two Treatises on Government" and that should give you an idea of how the PDF bookmark tree looks - by adjusting the styles in Tools > Outline Numbering, you can fine-tune the document you're working on to your liking. Let me know if you had a different meaning than I've understood.
Thanks!
Sam
AnsgarSerif 10-27-2006, 01:19 AM Here we go,
Major changes:
Everything (including page breaks) runs in OpenOffice.org
User prompt for Author, Title
Author, Title inserted in page headers
Separate macros for saving to RTF and PDF
The instructions are updated and sample files are included. Hopefully, this is full-featured enough to qualify as a frequently usable macro. It's a bit on the slow side - about three minutes for a 600 page book (formatted for the Sony Reader's screen size) and about ten minutes for an 1800 page book.
I've posted the new attachment up on my edited first post. Let me know what you think!
Sam
Here we go,
Major changes:
Everything (including page breaks) runs in OpenOffice.org
User prompt for Author, Title
Author, Title inserted in page headers
Separate macros for saving to RTF and PDF
The instructions are updated and sample files are included. Hopefully, this is full-featured enough to qualify as a frequently usable macro. It's a bit on the slow side - about three minutes for a 600 page book (formatted for the Sony Reader's screen size) and about ten minutes for an 1800 page book.
I've posted the new attachment up on my edited first post. Let me know what you think!
Sam
I have to say once again I'm very impressed!! This got me thinking, would this macro work for just about any text? I mean a lot of people have been wondering how to convert files into nice looking rtf/PDFs and what not, and well this could solve that problem^^;
The only thing that I'm not fully understanding now is, the whole "Book mark" thing for the pdf. I had to manually change things around for it to export and have the book marks there right. Is that what you had to do as well?
AnsgarSerif 10-27-2006, 11:49 AM njt,
Thanks! I don't think that this macro, in it's entirety, would work with a .doc or HTML file, since it'd probably start deleting line breaks that it shouldn't and foul everything up - it should be possible to start the process somewhere later along the line (say the HeadingStyles macro) and have a finished product that looks the same as a .txt conversion. I'll try it out later today.
Could you run a .txt file through the macro and then upload the .txt file, the .odt file and the .pdf file so I can see why the PDF bookmarking isn't working for you? Or do you get the bookmarks to show up on the computer but not the Sony Reader? If that's the case, then please let me know how I can change the program to make it right - I don't have my Reader yet, so I'm pretty much shooting from the hip with my eyes closed as to how it will display.
Note that if you've got a book structure that doesn't use the word "Book" or "Chapter" to separate different parts of the book, you're not likely to see chapter headings automatically applied - a good example is "Treasure Island," which uses "Parts" as books- since "Part" is not uncommon at the beginning of a normal sentence, I didn't include it in the macro - you have to set these to "Book Heading" yourself. To use "Treasure Island" again, none of the chapters have the "Chapter" label - it's just 1, 2, 3 ... you have to go through and manually adjust those, too.
I've uploaded (or will in five minutes time) a new .zip file, since I erased a function that converted the whole text to "Sony Reader - Text Body" before applying Heading Styles - that's why the text doesn't look consistent sometimes. The template is the only thing I changed, so I'll add it to this post. Sorry about that!
Sam
njt,
Thanks! I don't think that this macro, in it's entirety, would work with a .doc or HTML file, since it'd probably start deleting line breaks that it shouldn't and foul everything up - it should be possible to start the process somewhere later along the line (say the HeadingStyles macro) and have a finished product that looks the same as a .txt conversion. I'll try it out later today.
Could you run a .txt file through the macro and then upload the .txt file, the .odt file and the .pdf file so I can see why the PDF bookmarking isn't working for you? Or do you get the bookmarks to show up on the computer but not the Sony Reader? If that's the case, then please let me know how I can change the program to make it right - I don't have my Reader yet, so I'm pretty much shooting from the hip with my eyes closed as to how it will display.
Note that if you've got a book structure that doesn't use the word "Book" or "Chapter" to separate different parts of the book, you're not likely to see chapter headings automatically applied - a good example is "Treasure Island," which uses "Parts" as books- since "Part" is not uncommon at the beginning of a normal sentence, I didn't include it in the macro - you have to set these to "Book Heading" yourself. To use "Treasure Island" again, none of the chapters have the "Chapter" label - it's just 1, 2, 3 ... you have to go through and manually adjust those, too.
I've uploaded (or will in five minutes time) a new .zip file, since I erased a function that converted the whole text to "Sony Reader - Text Body" before applying Heading Styles - that's why the text doesn't look consistent sometimes. The template is the only thing I changed, so I'll add it to this post. Sorry about that!
Sam
Thanks for the new version. I'll be sure to try it out later tonight. As for you not having one yet, that makes sense now. The font at which you have it right now is quite quite small. (I changed mine to arial 11 bold) which works quite well :).
In regards to the html/ doc comment, what I meant was, they could convert those to txt and just convert again using this macro from there. Unless the html/doc had some fancy formatting this would probably produce a better looking book. Just my thought on that :).
As for the bookmarks, I just realized why. The books I was testing this out on were Sherlock and the time machine. Neither have chapters in the way you describe thus the need for manually editing.
Anyways, off to work! I'll try this later tonight.
AnsgarSerif 10-27-2006, 08:13 PM Ahhh!
I don't know why this didn't happen before but apparently there's a loop in the code, so the macro never finishes. Honestly, I tested the dang thing and it finished fine earlier. Lucifer's been touching my stuff again, it seems.
Anyhow, here's the new file (and the whole package has been updated in the first post). I changed the Text Body font to Garamond 13 pt. (I'm sorry, I just really really hate Arial - if you have Calibri, that's the money font).
I guess this is why it takes so long to get a new version of Windows to the public, eh?
Sam
kukafei 10-31-2006, 04:47 PM Sam,
I tried your tool and it works great. The only problem is that the PDF output is blank starting at page 1000 and page turns get really slow (several seconds) later on. The same thing happened with custom PDF's I printed from manybooks.net, but it seems to be a problem with the files and not the reader as someone else made a large PDF file that works fine (see this post http://www.mobileread.com/forums/showthread.php?p=44984#post44984 ).
AnsgarSerif 10-31-2006, 05:35 PM kukafei,
I'll play around with it tonight - thanks for the heads up!
I'll work on Tolstoy's "War and Peace" and upload a couple of different PDFs based on different software configurations (both in OOo and Adobe Acrobat). Let's hope that we don't have to add Acrobat to the necessary software, eh?
Sam
Thanlis 11-01-2006, 12:01 PM Hm. This is erroring out for me. Here's what happens:
I imported the macros as per the instructions. I close OpenOffice. I open up a new document, and run the Begin_Here macro.
It gets to the ChangeEmptyParaStyle macro, and errors out with:
BASIC runtime error.
An exception occured
Type: com.sun.star.lang.IllegalArgumentException
Message: .
This is a new OpenOffice installation, version 2.0.4.
AnsgarSerif 11-01-2006, 12:36 PM Thanlis,
I've recreated the error. Did you open a new document from the template or did you try to open a document by clicking on the "OpenOffice.org Writer" button?
The reason for my question is this: the styles I've set up won't automatically transfer over to the program itself - they're embedded in the template file. If you want to import them into the program, then you have to create a new default template. You can do this by importing the styles - there is a drop-down box on the top-right of the "Styles and Formatting" window named "New Style from Selection." Choose "Load Styles" and select the "Project Gutenberg Conversion Template" file. Then save the document as a template somewhere permanent (My Documents, for example) and then choose that as a default template from "File > Templates > Organize ..." by right-clicking on the "My Templates" folder and importing your new document then right-clicking the filename and choose "Use as my default template."
I like to have a header on every document I make, so that I can insert page numbers or my name without messing up the pagination later down the road; you can change pretty much anything you like on the file before you save it as a template - if you, for instance, want a different style or font as your default, change it before you save the template.
So, in short, the macro won't work without the new styles I created in the template. You either need to use the "Project Gutenberg Conversion Template" (which you can alter anything you need to, page size, font, styles, and re-save as a template file) or import the Sony Reader styles into your default template.
Sorry about the hassle; let me know what features you'd like to see when you get it working!
Sam
Thanlis 11-01-2006, 02:01 PM It works like a charm now. Thanks! And I love the formatting it produces, so I'm happy as a clam.
AnsgarSerif 11-01-2006, 02:05 PM That's the second time I've heard that analogy today - and I can't for the life of me figure out why clams are so inexplicably happy :happy2:
Glad it worked!
Sam
Bob Russell 11-01-2006, 02:17 PM That's the second time I've heard that analogy today - and I can't for the life of me figure out why clams are so inexplicably happyhttp://www.worldwidewords.org/qa/qa-hap1.htmThe saying is very definitely American, hardly known elsewhere. The fact is, we’ve lost its second half, which makes everything clear. The full expression is happy as a clam at high tide or happy as a clam at high water. Clam digging has to be done at low tide, when you stand a chance of finding them and extracting them. At high water, clams are comfortably covered in water and so able to feed, comparatively at ease and free of the risk that some hunter will rip them untimely from their sandy berths. I guess that’s a good enough definition of happy.
AnsgarSerif 11-01-2006, 02:55 PM Bob, that's awesome. Simply awesome.
As a former Alaskan clam-hunter (I was three and it was the best fun of my life) and as an Oxford English Dictionary enthusiast (can you believe the word "snarky" dates back to 1906?), I tip my hat to you.
Thanks - now I know the whole story!
Sam
AnsgarSerif 11-01-2006, 04:23 PM Sam,
I tried your tool and it works great. The only problem is that the PDF output is blank starting at page 1000 and page turns get really slow (several seconds) later on. The same thing happened with custom PDF's I printed from manybooks.net, but it seems to be a problem with the files and not the reader as someone else made a large PDF file that works fine (see this post http://www.mobileread.com/forums/showthread.php?p=44984#post44984 ).
kukafei,
I encoded Tolstoy's "War and Peace" into an ODT file, a RTF file, a PDF created using OpenOffice.org and a PDF printed in Adobe Acrobat. The file size is ~13MB, so I'll have to wait a day or so to upload it (I'm one of the few souls left unable to get high-speed Internet - 19.2 kbps!).
I don't have a Reader, so could you please troubleshoot this for me and tell me if the RTF and the PDFs display correctly?
Thanks!
Sam
I've tried out your newest version and no problem so far :D
Works smooth, fast and is efficient :). All I have to do is do some small edits after words but nothing too bad. Thanks much!
AnsgarSerif 11-02-2006, 09:02 PM kukafei,
Here you go - the zip file has an ODT file, an RTF file, and a PDF printed on Adobe Acrobat. I can't upload the PDF exported from OpenOffice, since it's 9MB, so you may have to use the ODT to export a PDF on your end. If you've got the time to spare, could you go through and see which ones (if any) work?
Thanks!
Sam
kukafei 11-08-2006, 08:02 PM Sam,
Sorry it took me so long to get back to you- I was out of town.
I tried out the files, and here's the scoop:
RTF files have no problem (I just increase the font size a bit so they're more readable)
PDF files: the sample you gave me with acrobat worked fine, though it was only partial (I'm assuming this was a portion of the text)
exporting it from Open Office either with your macro or the built in PDF export function caused the same problems (very slow and later pages missing).
I also tried printing the files to PDF manually, and PrimoPDF came up with the same problems, but AcrobatWriter worked fine. So it seems to be the PDF software- apparently only Acrobat works. Don't know if there's any way to fix this except use Acrobat. But the RTF files work great, so I will probably just stick with them.
Thanks again for all your work!
AnsgarSerif 11-12-2006, 12:17 PM kukafei,
Now it's my turn to apologize for the delay in replying! I've been getting kicked around for the last week by my tonsillectomy. It makes fine sense - just when I can't eat anything is when I get a week-long hunger for orange chicken ...
Well, I suppose RTFs will have to do for now, eh? Seems a shame that the longest books are the ones that have to go without automatic bookmarks. I'm going to start playing around with BookDesigner in a bit, though, so maybe we'll be able to mash something up with the two programs. Barring that, I'm sure there will be a BBeB extension to OpenOffice.org when Sony finally releases the specs.
Thanks for testing those for me!
Sam
Dalton 11-18-2006, 11:01 AM I was able to easily format Gutenberg texts using "Autoformat" in Word 2000. This got rid of the hard line breaks inside of paragraphs. The procedure is simple and quick, and the default formatting is fine.
1. Open the Gutenberg text file in Word
2. Select "Autoformat" from the "Format" menu
3. Set the title and author in "Properties" in the "File" menu
4. Save the file as .RTF
5. Import the RTF file in the Connect Reader application
It took only a few seconds to make a couple of long books this way -- "The Golden Bough" came to 1,687 pages, and "Notre Dame de Paris," in French, came to 875 pages.
|