AI could choke on its own exhaust as it fills the web
The internet is beginning to fill up with more and more content generated by artificial intelligence rather than human beings, posing weird new dangers both to human society and to the AI programs themselves.
What's happening: Experts estimate that AI-generated content could account for as much as 90% of information on the internet in a few years' time, as ChatGPT, Dall-E and similar programs spill torrents of verbiage and images into online spaces.
- That's happening in a world that hasn't yet figured out how to reliably label AI-generated output and differentiate it from human-created content.
The danger to human society is the now-familiar problem of information overload and degradation.
- AI turbocharges the ability to create mountains of new content while it undermines the ability to check that material for reliability and recycles biases and errors in the data that was used to train it.
- There's also widespread fear that AI could undermine the jobs of people who create content today, from artists and performers to journalists, editors and publishers. The current strike by Hollywood actors and writers underlines this risk.
The danger to AI itself is newer and stranger. A raft of recent research papers have introduced a novel lexicon of potential AI disorders that are just coming into view as the technology is more widely deployed and used.
- "Model collapse" is researchers' name for what happens to generative AI models, like OpenAI's GPT-3 and GPT-4, when they're trained using data produced by other AIs rather than human beings.
- Feed a model enough of this "synthetic" data, and the quality of the AI's answers can rapidly deteriorate, as the systems lock in on the most probable word choices and discard the "tail" choices that keep their output interesting.
- "Model Autophagy Disorder," or MAD, is how one set of researchers at Rice and Stanford universities dubbed the result of AI consuming its own products.
- "Habsburg AI" is what another researcher earlier this year labeled the phenomenon, likening it to inbreeding: "A system that is so heavily trained on the outputs of other generative AIs that it becomes an inbred mutant, likely with exaggerated, grotesque features."
There are multiple nightmares embedded in this scenario.
- Publishers, media companies and other providers of quality information — fearful of having their valuable content scraped by AI companies — could keep more of their content off the web or behind paywalls, further impoverishing the public sphere. That's just one of many risks industry analyst Ray Wang, CEO of Constellation Research, highlights in a sobering essay.
- Understanding the lineage and veracity of content is essential, Wang told Axios. "We need to know the original source and we need to know the derivatives."
- Tech providers have had more luck so far with schemes to label AI-generated images than with efforts to identify AI-written text.
Meanwhile, AI providers are likely to face new challenges trying to keep the data troves used to train AI models unpolluted by AI-generated material.
- "To avoid being affected by model collapse, companies should try to preserve access to pre-2023 bulk stores of data," one industry guide advises.
Yes, but: Some more specialized AI companies say that they aren't that worried about a glut of AI generated content on the web.
- "Human annotation is still critical to the success and quality of our models, rather than sole reliance on text on public websites," said Saurabh Baji, senior VP of engineering at Cohere. "With our enterprise focus, we also fine-tune based on a customer's internal documents, which greatly improves customization."
- Some observers believe that the glut of AI-generated content could add a new premium to any content that is human-crafted.
State of play: Less than a year into the generative AI revolution, big problems are already cropping up.
- MSN, which relies heavily on AI-generated content after laying off human writers, attracted all kinds of negative press earlier this month when it recommended the Ottawa Food Bank as a good spot for tourists on an empty stomach. (Microsoft, for its part, says unsupervised AI was not to blame.)
- A host of news sites have cropped up that appear to be doing little more than running human-written news stories through an AI engine and posting the results to the web. NewsGuard identified more than 400 such sites in a report last week.
- Those sites not only deprive the original publications of traffic, but could make it harder for the public to find real firsthand journalism out of a sea of AI copycats.
What's next: We don't know much yet about how a tide of AI content will change our world, and the best minds in AI aren't willing to predict how any of this will play out.
- As one leading company's researchers told us: "It's a very interesting question."