For the last three decades, the world has embraced the Internet and happily contributed to its volumes of information. Whether it’s web pages, blogs, social media posts, or forums, unfathomable amounts of data have been provided. Much of this information is valid, insightful facts that facilitate learning and education. But at the same time, much of what has been posted is simply a matter of opinion, and sometimes, downright false. While this revelation isn’t new to any of us, the presence of this varied integrity of information has new meanings. Specifically, this is the same data being used in the education of AI chatbots. And feeding the chatbots both good and bad data could lead to some serious embarrassments down the road.

Part of the problem now is knowing just how much of this bad data accounts for the education of AI chatbots. Suppose it’s as high as 50%. That could lead to some pretty concerning situations as well as some humorous ones. But even if the percentage of rather small, its impact on the future of disinformation could be substantial. If everyone increasingly relies on products like OpenAI’s ChatGPT, falsehoods could gain a growing foothold on shared information. For these reasons, it’s important that we think about exactly what we are feeding the chatbots of tomorrow. If we’re not careful, we might just find ourselves trapped in a quagmire of misinformation.
(A potential downside of AI chatbots? Disinformation. Read more in this Bold story.)
GIGO and the Education of AI Chatbots
There’s an old saying when it comes to integrity of data outputs. If bad data inputs exist, then bad outputs are inevitable. This garbage in, garbage out (GIGO) perspective perfectly applies to how we have used the Internet in the education of AI chatbots today. Every time we’ve written a silly blog or posted a farfetched opinion, we contribute to this information “garbage.” The same is true every time we have gone on a tirade in some Internet forum. If such sites are involved in feeding the chatbots, then the results they provide us will be less than accurate. The key question is just how much of this misguided information guides them.
Unfortunately, major AI chatbots like ChatGPT have not yet revealed on which data sources they are trained. Legislative efforts to reign in these AI tools are now considering requiring such sources to be revealed in some countries. This includes such efforts in the U.S. But the use of these inputs in the education of AI chatbots could be substantial. According to a project by the Washington Post, it appears nearly 4% of the data feeding the chatbots involve blogs. These are not peer-reviewed articles or publications by respected journalists and sources. Instead, they are simply anyone with an opinion, point of view, or perspective that often isn’t backed by data. Thus, while 4% may not seem like much, in the scheme of things this can lead to some pretty serious mishaps.

The Good, The Bad and The Ugly
It’s pretty easy to see the good when it comes to these new AI tools. Their potential in streamlining data access for a variety of endeavors is noteworthy. This is certainly true regarding AI in media publications and education when managed ethically. But the potential for undesirable effects is also quite high as history has already shown . The following are just a few of the more humorous events that have occurred resulting from a poor education of AI chatbots.
- In 2017, researchers used an AI language model to recreate new fiction involving the Harry Potter characters. The resulting story used a variety of sources to develop the new stories that went well beyond the Harry Potter series itself. The final result provided a rather hot-and-heavy romantic work among the more notable characters. Whatever inputs were used in feeding the chatbots were certainly not limited to the main themes of the original story.
- In 2018, a Reddit user provided Rick and Morty dialogues in the education of AI chatbots to create new shows. Rick and Morty is an adult sci-fi cartoon featured on Adult Swin network. As it turned out, the result was a script that made very little sense at all. It not only included some pretty weird character names like “Pickle Rick Sanchez. But it also was hard to follow and extremely surreal.
- In 2019, another AI language model was used to create a new TED Talk script. In the final product, the script suggested robots should be thought of as fluffy pillows and given more love. Clearly the data involved in the education of AI chatbots in this instance favored the rise of the machine. This is why knowing precisely the data being used in feeding the chatbots is essential. Unless we can validate the inputs, we might just be persuaded to yield to our new digital masters.
- Lastly, in 2020, another AI language model posed the question, “Why did the chicken cross the road?” In a million years, you would probably never have guessed the answer it offered. That answer was, “To prove to the armadillo that it was possible.” It isn’t known what was used in the education of AI chatbots in this case. But it’s pretty clear there was not much discernment in the data feeding the chatbots in this case.

A Measure of Accountability
Up until now, perhaps there has been little to hold all of us bloggers, posters, and forum contributors accountable. But now that AI is not just providing new content but AI is also creating music and images, it’s time reconsider our role in feeding the chatbots. If we want to avoid the types of comical events listed above, we have to take the education of AI chatbots seriously. That means being more careful about what we post and ensuring greater integrity of our statements. Of course, that will do little to correct the last 30 years of our Internet folly. But it’s likely a step in the right direction if we want AI to be the best informational tool it can be.