When it comes to training AI systems, content is king. Text, images, and videos provide the essential ingredients for AI to advance and to progress toward its potential. The Internet offers a rich playground in this regard, but that doesn’t mean there’s not a price of admission. While AI companies have enjoyed some laxities in the early years of AI training, that is rapidly changing. Several major web databases are changing the rules, and as a result, introducing hidden costs of AI training. In some cases, access to content is strictly prohibited, which could spell a real AI crisis down the road. The bottom line is that AI needs content and a tremendous amount of it to thrive. Unless it secures access to such content moving forward, its potential may be far from realized.
On the one hand, a simple solution would be for AI platforms to make payment arrangements to access data. In fact, such discussions are already taking place between major AI companies and publishers and online data sources. But this too threatens the status quo, especially as it relates to other entities that rely on content and data. While AI companies could potentially foot the bill, researchers and academics might not be able to do so. These are additional hidden costs of AI that exceed a real AI crisis in content. The problem is that a pay-to-play system that’s equitable and fair doesn’t exist when it comes to data access. Until it does, AI will continue to usurp training data at the expense of others.
“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods.” – Yacine Jernite, AI Researcher, Hugging Face
Recent Trends in Data Protections
When it comes to web-based data, there are a select few data collectives that offer significant quantities of datasets. Three such companies include C4, RefinedWeb, and Dolma. Together, these three data companies account for 5% of all data on the Internet, and they represent 25% of all quality data, the kind that AI companies want for training. But for some time, training systems have been imposing some significant hidden costs of AI training on these data systems. Using bots to crawl these collectives, many different AI training systems have enjoyed access without paying. Seemingly, however, that is rapidly changing, which is why some believe there will be a real AI crisis in content access.
A recent research study examined how these three major data collectives are currently operating as it relates to content access. In looking at 14,000 websites, MIT researchers found the vast majority are now imposing data access restrictions. Many prevent bots from crawling their sites by using specific file types. Such files, called robots.txt, utilize a Robots Exclusion Protocol to distinguish bots from humans. This is a roadblock to AI training access resulting in hidden costs of AI systems trying to train on free data. In addition, 45% of these websites restricted data access through their terms of service. Combined with bot crawling restrictions, these could dramatically reduce content access and cause a real AI crisis. These are noteworthy changes that seems to be occurring more regularly in recent months among data collectives.
“Major tech companies already have all of the data. Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers.” – Stella Biderman, the executive director of EleutherAI
The Real Victims of Limited Data Access
While many in AI might be concerned about a real AI crisis in content access, in reality, others will potentially suffer more. Presuming a pay-to-play scenario may evolve, companies like OpenAI, Google, Microsoft, and Meta enjoy adequate resources. This is not the case for others, however. For example, academics who rely on data access to conduct scientific and marketing studies would be at a loss. Likewise, smaller AI developers would also be hindered given that they do not have the same access to major funds. These are the hidden costs of AI that extend beyond major players to smaller AI enterprises. Thus, while new systems are being implemented to restrict data access, these restrictions don’t simply affect big AI companies.
The impacts on smaller AI startups and researchers are worth noting with these new data access changes. But that doesn’t mean others aren’t also affected. For example, several publishers have already been affected by unauthorized data access by AI training systems. The New York Times is currently suing OpenAI and Microsoft for such unauthorized access to its content. Others like Reddit and Stack Overflow have similarly recognized such abuses and are changing their policies. The hidden costs of AI and data access abuses by these systems have created an unpleasant situation. And publishers and data collectives are now pushing back. The problem is that the pushback is not discriminative. That means any potential real AI crisis related to content access will have much broader effects that are unintended.
“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities.” – Shayne Longpre, Researcher, Data Provenance Initiative
Creating an Equitable Solution
The debate as to whether major AI companies should be allowed access to public domain Internet data continues to take place. But increasingly, there is a sentiment that such data belongs to publishers and online producers. If that is the case, and if copyright infringement is cited, then AI companies will have to pay for data access for training. (Read up on the impending conflict between AIs and copyright law in this Bold story.) These hidden costs of AI have thus far been the burden of data collectives and publishers. But that will likely change sooner rather than later. However, in order to avoid undesirable effects on smaller enterprises, new technology tools and systems need to be devised. Such tools and systems would be able to discern who is accessing data and a payment schedule accordingly. Such systems do not currently exist, which is why blanket solutions to unauthorized data access are being pursued. Unless better systems evolve, the real AI crisis won’t involve big AI companies but smaller ones as well. And this play right into the hands of Big AI’s goals to dominate the market.
Can AI be the boost small businesses need? Read this Bold story and find out.