In the short tenure of generative AI’s presence in the world, a number of issues have been raised about AI training strategies. Using public content on the Internet has fueled a debate about whether such content is protected by copyright law. Likewise, there are concerns about the costs and the intensive use of energy required for training AI models. But one that may not be as well recognized involves the potential for AI model collapse as training data becomes constrained. As it turns out, training AI on its own generated content is far from ideal and can lead to all sorts of problems. And as more and more AI content contaminates the Internet, this is becoming an increasing worry.

(AIs can help out startups–read more in this Bold story.)
At the moment, the risk of AI model collapse from Internet training strategies isn’t an immediate issue. However, as more and more AI generated material appears, there could be a progressive decline in the content quality and diversity. This could have far-reaching repercussions should this occur. On the one hand, this could lead to both biased and false outputs of text. It could also create conformities as well as glitchy images as well. And at the extreme, training AI on its own generated content could result in nonsensical results. Major AI companies are aware of these risks, and they will strive to avoid these developments. But in order to truly prevent AI model collapse, innovative solutions will definitely be required in years to come.
AI Model Collapse – A Primer
Currently, the vast majority of the content that AI training systems use is human generated. The Internet contains massive amounts of human-created articles, social media posts, reviews and more. Of course, AI needs such massive data collections to advance in a way that will meet demand. And at the same AI is now generated an impressive amount of its own content, both text and image, to supplement human content. OpenAI founder, Sam Altman, recently reported its AI system was generating over 1 billion words in a day. Certainly, much of this reaches the Internet in some form or fashion. This is where things begin to get tricky. Over time, it’s evident that such volumes of AI material will dilute that created by actual people. This of course means training AI on its own generated content will at some point be inevitable.
(What’s going on with OpenAI now? Read this Bold story and find out.)
The problem is that training AI on its own generated content undermines its capacity to create. In essence, all AI systems consume data and assemble a statistical distribution of probable results when posed with a query. When human data is used, the distribution of possible outcomes is broad and diverse. But as AI-generated content increases in the training dataset, diversity narrows. As a result, the potential responses to a query become less and less with increased amounts of AI-generated content. And at some point, this range narrows so much that it causes AI model collapse. Not only do AI outputs become nearly uniform, but they may even become a bunch of gibberish and glitches. Basically, it’s the lack of diversity in data that undermines the AI system training and eventually causes this AI model collapse.

Research Evidence and Concerns
In evaluating AI model collapse, research has been explored system outputs when training AI on its own generated content. In this research, each training session using a dataset is termed a generation of training. As it turns out, if the dataset has a significant amount of AI-generated content, negative effects are seen progressively. After a few generations, AI responses become more and more alike for various prompts. At the same time, image generations also begin to look increasingly similar. By the time that 30 generations are reached, outputs degenerate even further not only in the diversity of responses but in the quality. Naturally, the proportion of AI content to human content affects how fast this occurs. But in all cases, progressive declines occur if AI content exists in the training dataset.
Certainly, quality and diversity of responses are a primary worry when training AI on its own generated content. But it has also been found that the presence of AI data in training datasets also increases the amount of computing power needed for training. Given that AI training is both financially and energy intensive as it is, this poses additional dilemmas. This doesn’t mean that synthetic data created by AI can’t be utilized. For example, training smaller AI systems with synthetic data from larger ones does provide some opportunities. But when pursuing more advanced and larger AI systems, the use of synthetic data is limited. Unless these issues are addressed sooner rather than later, AI model collapse could be a real worry among some developers.
Bold Solutions to Consider

As noted, major AI developers and systems such as OpenAI and others are aware of the issues. They understand that solutions will be required to avoid training AI on its own generated content. This is especially true as it relates to content that fails to include diversity and rare results. Some are pursing curated synthetic data content. In other words, they are hand-selecting which AI-generated content is used to train systems. In doing so, the look to avoid or at least delay a progression to AI model collapse. However, these solutions are time-consuming and costly, and they hinder AI system advances. While this may be a key component of a broader AI training strategy, other solutions are needed to complement this.
Other potential solutions are ones that better ensure the datasets provided for AI training contain ample human data. For example, developing a system where AI developers pay for content that’s guaranteed to be human-generated is one example. Another would be creating more advanced AI digital labeling systems that can identify which content is AI-created and which is not. To date, neither such solutions have been effectively pursued or developed. And until these are or other innovations appear, the risks of training AI on its own generated content will persist.
Outsourcing is essential for business growth–read why in this Bold story.