Quote:
Originally Posted by graycyn
But I feel sort of bad that you typed all that out, as I've already gotten the curly quotes dealt with, finished that several days ago. But I picked up a zillion other small errors in the process.
|
Well, you'll be working on more books in the future.
Now something that took you days will take seconds, and the rest of the free time can be spent on hunting down typos or more important issues.
Same with the common patterns of errors. Once you notice one, regex can find them all.
Quote:
Originally Posted by Quoth
Most wordprocessors do ’tis ’90 etc wrong.
|
That's another regex I use:
Find: ‘([0-9])
Replace: ’\1
That finds shortened years like:
- In the ‘90s, ...
- In the ‘70s, ...
and flips it to the correct RIGHT SINGLE QUOTE.
Quote:
Originally Posted by graycyn
The Gutenberg text is much more of a trainwreck than I'd initially thought. I've found missing punctuation, including some of the quotes, but also em-dashes, hyphens, and some entire words! And a great deal of paragraph problems! Oh, and accented characters missing as well.
|
Looks like it was one of the very early conversions:
https://www.gutenberg.org/ebooks/3795
Now they're at book ~67k.
The quality of that stuff was not so good back then, but I'm still surprised such italics/typo errors snuck through.
The one frustration I have with Gutenberg books is they don't offer the original scan (PDF) they worked off of. This would allow you to go in there and re-correct based on the same source + bring it up to today's standards.
Modern PG books (done with Distributed Proofreaders) go through lots more rounds of proofing. If they redid this book now, it would definitely be much higher quality.