MobileRead Forums - View Single Post

AnotherCat · 07-10-2014, 12:26 AM

[QUOTE=Tex2002ans;2868388]...Doing it the manual way, there is no way to "ignore" a word that you already know is correctly hyphenated.
If "long-term" shows up 183 times, you had to Next 183 times.
If "short-term" shows up 111 times, you had to Next 111 times.
If "Irvington-on-Hudson" shows up 34 times, you had to Next 68 times.
Now, since all unique words are shoved into the list ONCE, this really saves the amount of time your eyeballs have to work + how many times you have to ignore/fix mistakes.
Just those three words, you have wittled down 362 clicks into a quick look at a list.[QUOTE]

You appear to have done the injustice of assuming that what I stated applied to everything. In fact my original response was to the specific case of very poor OCR as was raised by another poster. I then went on to give the impression, correctly, that approach was in fact my common one. But you have extrapolated that into my including complex and 4,400 page books as well, whereas I was talking in general terms. I also did so in this Calibre Editor forum and so ignored raising what one may do outside of Editor, and was assuming the typical simpler use that most users put it to. I expected one of greater talents to see that was so and to give some space when interpreting.

In your quote above you give some (meant to be typical? I don't know) statistics. In fact for a typical fiction book, for example, one may be lucky to find 10 hyphenated words per chapter, or 10 cases of a particular hyphenated word in the whole book. Nor will there be a fleet of the likes of 183 "long-terms", 111 "short-terms" etc.; one will be unlucky to have even one of such frequency (I've been looking in a few fiction books to check).

There will likely be many more hyphens than that but those being such as interruptions to character's speech (e.g. "I said that we are going to-"), used instead of ems, etc., and which are most easily eradicated, if need be, by means other than spell-check, and if of the interruption "to-" case each unlikely to of frequency worthy of doing a bulk ignore on them. So for the case of working with clean text in ordinary work, the statistics you give are, in my view, exaggerated. But, as always there will be exceptions, in which case give the space to recognise that I may take an alternative approach to those.

Some examples using your hyphenated words to show dangers of bulk assumptions-

You give the example of "long-term" as being a unique word for bulk ignore; it is indeed a unique word when used correctly in the text as an adjective, but likely not be if used otherwise. For example, should the use along the lines of "he was incarcerated for a long-term" be accepted? Bulk ignore results in that potential error remaining uncorrected. Exactly the same risk applies to your other example of "short-term" (e.g. is "he was incarcerated for only a short-term" correct?) and the many other similar examples one comes across.

Getting to poor OCR source (which was what I originally responded to). Given the many likely permatations of errors of the types "long- term", "long - term", "1ong-term", "long'-term", etc., etc., and errors of the type "*m-ty", bulk ignores become increasingly unhelpful as the number of errors being the same becomes increasing unlikely with the rattyness of the OCR. One is going to have to rely on much to and froing from the reference scans or paper book to do justice to the book.

There is much else that could be said about the risks of bulk ignores or substitutions, but I will constrain myself and I am sure you are as aware of them as anyone. In the end, one is going to have to get ones nose into the scan or paper book which is the reference.

Quote:

Originally Posted by Tex2002ans

Side Note: This latest journal I am working on digitizing (~4400 pages, ~2 million words), there were ~18800 hyphens before -> 18428 after fixing (this means ~2% of the hyphens were a mistake.)
That would have taken fracking FOREVER to do one-by-one (it already took me 12 hours to do it the Spell Check List way (including all the time double-checking/fixing the source material, plus doing some code cleanup + other spelling corrections)...

Indeed it might have taken forever (remembering that Kingsley happily wrote "for-ever"

), but it is rather unfair to assume that I would use what you seem to have extrapolated, for the sake of constructing a criticism, into the likes of being my Gold Standard approach for everything including a 4,400 page book.

In the event the largest books I work with are of the likes of official or authorative histories which may run to a meagre 1,200 pages or so, but are complex and can serve as examples. If my source was text then there is no way that I would be cleaning them up in Editor (remembering this forum is about Editor and so that is what my earlier posts confined themselves too) nor be first converting to HTML. If my source was HTML, as it can be when made available from academic websites, for example, then I would have a serious think about how I would go about it and that may likely not be using Editor (nor Sigil, which I don't use now).

In such books one can be led into serious difficulties if correctness is important and one starts relying on bulk ignores and substitutions. Just a simple example is that your example of "long-term" may be used by the book's author but within a quotation "long-term" is incorrectly used by him as the author of that quoted passage (as checking in the reference scan or paper book shows) actually used "longterm" or maybe "long term" (the latter, incorrectly as an adjective). If one's final work is to be a correct rendition one must follow the originally published work unless commissioned to modernise it.

If the book is not a modern one (or is a modern one talking about the past) then the alternative possible hyphenations (together with the various possible spellings of the words hyphenated together themselves) become mind frazzling and well beyond the reach of a simple spell checker as is likely to be found in Editor.

Looking at "long-tongued" as an example, the text may contain "long-tongued" when talking in modern terms but also contain "lang-tongued" which if bulk corrected might be incorrectly so if it was referring to a passage of Walter Scott's work. Furthermore the same book may also include "longtongued" if referring to work from the 16th Century in which case one will have to check if that is correct for the era or work referred to; so even bulk assumptions of non hyphenated words that might be hyphenated are potentially dangerous. (As you may guess, I've been peeking in the OED). So one may see in spell-check that there are "183" instances of "long-tongued" but it may be that there are not, and until one actually looks at each case one does not know.

I'll leave it at that and make no further comment about the matter, I've said more than plenty and am well past being boring