Quote:
Originally Posted by Quoth
Probably wasn't checked at all.
It's not "just copying". But calling it "training" (or machine learning) is a lie, Marketing-Speak, and the content for LLMs is copied, but just not in a normal easily accessible replay with sane source identification (which would be more useful as then you'd have a decent interactively driven search engine)*. The entire LLM part of AI is dishonest and misleading. No computer system learns or is trained in the sense the words are traditionally used. The entire industry segment is like Humpty Dumpty in Alice through the Looking Glass. They are using anthropomorphic phrasing and descriptions that are dishonest or at best misleading.
[* Something I proposed about 25 years ago. I was a professional, qualified programmer and had studied AI for over a decade by then.]
|
It's not a lie or marketing speak. Training has been used in the ML literature for decades at this point well before LLMs. It's literally mentioned on page one of the first edition of "Elements of Statistical Learning" from 2001 and I'm sure it was used somewhere else long before then I just can't be bothered to check. It makes perfect sense in the context it's used. Models are "trained" on data, then we make predictions from the model we trained on some new data. With LLMs being language models, the training data is a huge corpus of text, the new data is the prompt. There's not really a better one word summary than training.
LLMs can't recreate the entirety of the source texts. Maybe snippets or summaries but not the entire thing. It's like expecting a line of best fit to be able to recreate an entire data set. Maybe some of the original points were on the line, but probably not all.