View Single Post
Old 08-30-2008, 04:49 PM   #18
Patricia
Reader
Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.
 
Patricia's Avatar
 
Posts: 11,504
Karma: 8720163
Join Date: May 2007
Location: South Wales, UK
Device: Sony PRS-500, PRS-505, Asus EEEpc 4G
How I work

A quick summary of how I clean up the files from The Internet Archive or Google Books (or any PDF from an image file):

1. Go to the source file, say the Internet Archive. Download the text file and a PDF.
2. Try saving the PDF as a text file and see whether it is better than the downloaded text file. Usually there's not much difference.
3. Paste the text file into a doc.
4. Remove all headers and footers.
5. Run a heap of 'Find and Replace' commands to overcome common mistakes:
viz:
(a) find ' ?' replace with '?'.
(b) find ' !' replace with '!'
repeat for every punctuation mark--many of these have an unnecessary space in front of them.
(c) find 'hyphen space paragraph mark' replace with 'paragraph mark' to remove unnecessary hyphenation at the hard line breaks. It's best do check each one individually because some hyphens are meant to be there.
(d) find 'paragraph mark " space' replace with 'paragraph mark" ' --this is a common OCR error.
6. Run Stingo's macro.
7. Check every instance of 'space"space' and correct them.
8. Check every instance of 'space'space' and correct them.
9. If you like curly quotes then do a find and replace of " with " and ' with '. Then check each instance where they occur after a dash of any sort. Also check each instance of space' (because of contractions like 'em, 'tis, 'twas, etc).
10. Now open both the PDF and the Doc. Adjust the page sizes so that each takes up half the screen. Read them side by side. You will have to add dashes (OCR often misses them out) and italics. Focus mostly on these, and on obvious spelling errors. Also add any missing accents.
11. Now run a spell-check on the doc. Note all dubious cases and check them.
12. If the source was very poor then repeat step 10. I often check every instance of ' and ", because these often get missed out.
13. Get the Chapter headings centred and in Bold. Ditto the Author and Title.
14. Insert any pictures.
15. Move any footnotes to the end of sections, the end of the book or wherever you want them.

Now the text should be ready for either Book Designer or Calibre, or your favoured conversion program.
The good news is that a conversion in Book Designer can now be done in less than 5 minutes. The bad news is that you will have spent many hours tidying up the original text. (I really daren't count.)

Obviously, you can do this in stages, a few minutes at a time. Take notes, so that you know what you have already done.

I hope this helps.
Patricia is offline   Reply With Quote