A recent research paper titled ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’ has shed light on a critical issue known as ‘Model Collapse’. This phenomenon, seemingly innocuous but potentially devastating, is characterized by irreversible defects that emerge when models are trained using data they have generated themselves. The result? A gradual erasure of the tails of the original content distribution.
This groundbreaking study, the result of collaborative efforts by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson, has identified Model Collapse across various generative models. These include Variational Autoencoders, Gaussian Mixture Models, and Large Language Models (LLMs).
“We have discovered a pervasive problem that affects all learned generative models,” said the team. “It’s a ticking time bomb that threatens the efficiency and effectiveness of these models.”
The study stresses the importance of maintaining large-scale data from the web for training, underscoring the value of genuine human interactions. However, it also identifies a menace in disguise: machine-generated data, such as articles produced by LLMs or images created by AI.
As LLMs like OpenAI’s ChatGPT and Google’s Bard become increasingly common, their output becomes part of the training data for new models. And here lies the rub: this machine-generated content, while voluminous, lacks the diversity and authenticity of human-generated data.
This problem is particularly acute in models that use continual learning approaches. Unlike traditional models, which learn from static data, these models adapt to a dynamic flow of data. This makes them especially susceptible to ‘data poisoning’, where the training set gets contaminated with unrealistic data, leading to a distorted perception of reality.
To tackle this looming crisis, the researchers propose a two-pronged approach. First, they suggest preserving the authenticity and diversity of training data through additional collaborator reviews. Second, they recommend regulating the use of machine-generated data in training models.
The findings of this study are particularly significant in light of the growing reliance on LLMs in various industries. From life sciences to supply chain management and the content industry, LLMs are being used for a wide range of tasks. As such, it’s crucial to continually refine these models, ensuring they provide realistic and reliable outputs.