The race is on. From the major players to startups, companies are investing heavily in artificial intelligence. Generative AI models have already changed the way people tackle projects and the way businesses operate. And it’s only been little over a year since generative AI was introduced. But in order to advance AI, massive amounts of data are required to train these systems. Thus far, companies pursuing such endeavors have tapped into the Internet for this purpose. However, with so many AI companies jockeying for position, there appears to be a potential data crisis in the AI economy. An AI company data shortage could prevent businesses to compete at a more advanced level… and that includes even existing ones like OpenAI.
(It’s perilous to program AI with biases–read why in this Bold story.)
At this point, a data crisis in the AI economy has not yet realized. However, according to experts, real issues could develop within the next few years. Businesses involved in AI are aware of these risks, which is why many are pursuing innovative approaches to AI training. Most aren’t publicizing details of these new strategies since this in itself may offer a competitive advantage. But several conceptual models have been offered that likely represents some of the approaches these businesses are considering. The goal is to avoid an AI company data shortage while finding more efficient ways to train AI systems. Given these market pressures, expect some bold solutions to be forthcoming in the years to come.
A Tutorial on AI Training Systems
It’s no secret that generative AI systems require large quantities of data for effective training. However, the actual amount of data may surprise you. In essence, these AI systems are “fed” words or parts of words from large data pools that are referred to as tokens. ChatGPT-4 required over 12 trillion of these tokens for its training, which is an astounding figure. The next version of this generative AI model, ChatGPT-5, is expected to require between 60 and 100 trillion tokens. Based on current estimates, there might be a gap of 10-20 trillion tokens that could prevent AI advancement. This is concerning, especially for Open AI, but this is just one potential AI company data shortage among many. This is why some are forecasting a data crisis in the AI economy in the next few years.
Part of the problem when it comes to using the Internet for these tokens relates to quality of data. Believe it or not, only about 10% of the data on the Internet is considered adequate for AI systems’ training. This greatly contributes to the AI company data shortage, and it’s why these companies constantly seek higher quality data. The lawsuit filed by The New York Times against Open AI reflects the dilemma in part. In an effort to best train AI systems, Open AI allegedly used New York Times’ data. Of course, permission was not requested, which is why Open AI has been accused of copyright violations. Thus, the lack of quality data, and access to it, is contributing to the data crisis in the AI economy.
(Dig into the copyright battle facing AI companies in this Bold story.)
New Strategies for AI Training
Assuming there’s a data crisis in the AI economy looming, pursuits to squeeze more quality data out of the Internet may be difficult. This is especially true in the short term. However, that doesn’t mean bold businesses aren’t developing some potential solutions to do just that. For example, Open AI is considering transcribing video/audio from the Internet to acquire additional tokens. Tapping into YouTube videos could provide a new stream of data for generative AI training. By using platforms like Whisper, an automated speech recognition tool, such transcriptions could be readily attained. However, once again, such efforts may run into a quality data problem from such videos. Therefore, this could be another temporary solution to a larger data crisis in the AI economy.
These aren’t the only innovative methods that businesses are pursuing to avoid an SAI company data shortage. DatologyAI, for example, has a unique approach that could reduce the AI training data required by 50%. This company, which is categorized as a data selection tool startup, is investigating different curricular learning systems. These systems provide AI models with specific data in a specific order to enhance learning and concept linkage. The theory is that such an approach speeds up learning and requires fewer token inputs. In other words, DatologyAI is trying to solve the data crisis in the AI economy through more effective training techniques. Rather than trying to acquire greater quantities of data, these strategies are more qualitative in nature.
At the same time, other AI companies have considered allowing AI systems to create their own data for training. Termed synthetic data, this could be effective, but it has some potential drawbacks. Specifically, this type of inbreeding mode of data training could lead to some nonsensical results. Others are suggesting a data market should be developed where companies could purchase quality data from others. Such a data marketplace would alleviate the AI company data shortage for a while. However, it wouldn’t likely avert an eventual data crisis in the AI economy later. Regardless, these are some approaches being evaluated to address AI company data shortage.
Cause for Concern or False Fear?
Despite some experts predicting a data crisis in the AI economy, not everyone agrees. Some believe there’s little cause for concern because of their faith in the marketplace. As with other crises unrelated to AI company data shortage, solutions emerge driven by necessity. Thus, many believe AI startups like DatologyAI will identify ways around a paucity of AI training data. At the same time, AI companies including Microsoft are starting to pursue more targeted and small AI models. These AI systems will serve more specific purposes, and they similarly require much less data tokens for training. If the generative AI marketplace moves heavily in this direction, a data crisis in the AI economy may never occur. But as things stand now, the Internet alone isn’t going to be enough to support advances in large language AI models. Innovations and market shifts will be required to avert an AI company data shortage.
Generative AI and Higher Education Are the Perfect Match–Read How in this Bold Story!