Apr 5, 2024 - Technology

For AI firms, anything "public" is fair game

headshot
Illustration of a pair of quotation marks in the form of half-lidded side-looking eyes over a background of binary code

Illustration: Sarah Grillo/Axios

Leading AI companies have a favorite phrase when it comes to describing where they get the data to train their models: They say it's "publicly available" on the internet.

Why it matters: "Publicly available" can sound like the company has permission to use the information — but, in many ways, it's more like the legal equivalent of "finders, keepers."

"That phrase confuses people," developer Ed Newton-Rex tells Axios. "It's probably designed to confuse people."

  • Newton-Rex spent years building AI audio systems before resigning from Stability AI, citing concerns about generative systems built with copyrighted material.
  • "Publicly available" doesn't mean anyone has given permission for use in the training of an AI system, Newton-Rex notes.
  • "Essentially all they are saying is, 'We have not illegally hacked into a system,' " says Newton-Rex, who now runs Fairly Trained, an organization that certifies models built on either licensed or public domain data.

Zoom in: The term, perhaps by design, sounds like "public domain" — which refers to information that is no longer subject to copyright protection or otherwise made freely available.

  • Lots of information is "publicly available" but subject to various protections, including copyright.
  • Large collections of pirated content have been made "publicly available" without the permission or consent of the creators.

"Many of the 'publicly available' books they took were from websites known for pirated content," Clarkson Law Firm partner Timothy K. Giordano tells Axios. "The receipt and subsequent commercial misuse of stolen property won't play well before a jury." (Clarkson has brought several suits against AI companies.)

  • Authors, publishers and copyright holders have brought a number of suits arguing that AI companies are engaging in massive copyright infringement, both in the training and operation of their products.

The big picture: The debate over training data comes as AI companies seek even more data sources to train ever-more powerful versions of their large language models.

  • Earlier this week, the Wall Street Journal reported that OpenAI was considering training models on YouTube transcripts as the industry peeks around corners to find high-quality information suitable for the task.
  • Companies are even looking at ways to train models on data that itself was created by AI, also known as synthetic data.

The other side: The AI companies have two chief legal arguments.

  • Many maintain that their broad use of copyrighted material is legal under the doctrine of "fair use," which courts apply using a complex four-part standard.
  • However, as Giordano notes, "the public status of copyrighted material" is not one of those factors.
  • A decade ago, the Google Books decision held that Google's use of "text snippets" to catalogue published works was an acceptable fair use, and AI companies often point to Google's win to back their argument.
  • The second argument is that copyright is not an issue in AI training because AI systems don't copy material: They just "learn" from it the way a human might.

Reality check: AI companies often refuse to say which "publicly available" data they are using, with OpenAI and others describing their sources as a competitive trade secret.

  • In a recent video interview with the Wall Street Journal, OpenAI's Mira Murati said the data used to train the Sora video engine came from publicly available sources — but she declined to be more specific, including refusing to say whether YouTube videos were included.
  • In another recent interview, YouTube CEO Neal Mohan told Bloomberg's Emily Chang that if OpenAI was training Sora on its videos, that would be a "clear violation" of YouTube's terms of service.

Privacy is also in play. Data that is "publicly available" but tucked away in some obscure site could be much more widely circulated by an AI chatbot trained on such information.

  • AI didn't create this problem — the internet kicked it off, as records once confined to a courthouse or clerk's office became digitized and then amalgamated by data brokers.
  • However, AI chatbots add a new, accessible means to unearth and widely share once-obscure information.

What the companies say: OpenAI tells Axios it uses a mix of licensed data in addition to information from the internet, and that it provides a way for site owners to block their data from being used to train future models.

  • As for what constitutes "publicly available" content, OpenAI says, "We only use publicly available information that is freely and openly available on the internet — for example, we do not use information that is password protected or behind paywalls."

Google, a company representative tells Axios, trains its models primarily on publicly available data from the internet, with sources that include blog posts, media transcripts and public conversation forums.

  • It says it also provides a mechanism for publishers to indicate that they don't want models trained on their content.

Meta says, in its responsible use guide for Llama 2, "The training datasets for Llama are sourced from a broad set of diverse, publicly available online data."

Microsoft "uses a variety of data sources, including publicly available information, in a manner consistent with copyright and IP laws," a company representative tells Axios.

Go deeper