View Single Post
Old 04-01-2008, 10:22 PM   #22
moz
Addict
moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.moz once ate a cherry pie in a record 7 seconds.
 
moz's Avatar
 
Posts: 370
Karma: 1553
Join Date: Feb 2008
Location: Melbun
Device: Kobo H2O
Quote:
Originally Posted by wallcraft View Post
I agree with kovidgoyal that paragraph numbers are best for referencing, and should always be available (particularly as they are easy to generate).
One problem I'm finding as I rip more books is that while easy to generate, paragraph numbers are not necessarily accurate. Often there's a combination of variable quoting and paragraphs in the original text that means I spend a bit of time working out exactly how paragraphs should break, and my decisions are obviously not the only ones that could be made. That's fine for a few occurrences, but in "Pushing Ice" (http://en.wikipedia.org/wiki/Pushing_Ice) there's a lot of speech and so even a low proportion of errors will see the paragraph count off by as much as the error rate. So 5% might mean that paragraph 1000 is actually paragraph 1050, or 950, depending on which way it goes.

Made up example:
'...and that's the story.
'But which way do we go?' he asked.

Is that one paragraph or two? If the first line is just before a page break and the second just after, it's not necessarily obvious. You have to read through and guess which character(s) spoke, and deal with the missing quote in your own way.

These corrections are easy enough for me to do now, since I have the original images and can check against them. If you got a passage like that off the net you'd have no chance. It's not helped by lax typography, which a genuine instance of ".' '" in one book I read - I'd normally hit that with a regex to insert a paragraph marker but when I found it in the book it was unusual enough that I wrote it down as a reminder. After some consideration, I decided that it was not actually a proofing error in the book and should be left as is, or (as I did) have the quotes removed.

Also, remember that between unicode and html we have something like 8 or 10 paragraph markers to choose from.

That said, I think paragraph markers are the least awful standard.
moz is offline   Reply With Quote