Axios AI+

A floating, translucent blue 3D render of the human brain.

April 05, 2024

Ina Fried

Ina here, wishing a belated happy birthday to Jane Goodall, who turned 90 on Wednesday. Today's AI+ is 1,228 words, a 4.5-minute read.

1 big thing: For AI firms, anything "public" is fair game

Illustration of a pair of quotation marks in the form of half-lidded side-looking eyes over a background of binary code — Illustration: Sarah Grillo/Axios

Leading AI companies have a favorite phrase when it comes to describing where they get the data to train their models: They say it's "publicly available" on the internet.

Why it matters: "Publicly available" can sound like the company has permission to use the information — but, in many ways, it's more like the legal equivalent of "finders, keepers."

"That phrase confuses people," developer Ed Newton-Rex tells Axios. "It's probably designed to confuse people."

Newton-Rex spent years building AI audio systems before resigning from Stability AI, citing concerns about generative systems built with copyrighted material.
"Publicly available" doesn't mean anyone has given permission for use in the training of an AI system, Newton-Rex notes.
"Essentially all they are saying is, 'We have not illegally hacked into a system,'" says Newton-Rex, who now runs Fairly Trained, an organization that certifies models built on either licensed or public domain data.

Zoom in: The term, perhaps by design, sounds like "public domain" — which refers to information that is no longer subject to copyright protection or otherwise made freely available.

Lots of information is "publicly available" but subject to various protections, including copyright.
Large collections of pirated content have been made "publicly available" without the permission or consent of the creators.

"Many of the 'publicly available' books they took were from websites known for pirated content," Clarkson Law Firm partner Timothy K. Giordano tells Axios. "The receipt and subsequent commercial misuse of stolen property won't play well before a jury." (Clarkson has brought several suits against AI companies.)

Authors, publishers and copyright holders have brought a number of suits arguing that AI companies are engaging in massive copyright infringement, both in the training and operation of their products.

The big picture: The debate over training data comes as AI companies seek even more data sources to train ever-more powerful versions of their large language models.

Earlier this week, the Wall Street Journal reported that OpenAI was considering training models on YouTube transcripts as the industry peeks around corners to find high-quality information suitable for the task.
Companies are even looking at ways to train models on data that itself was created by AI, also known as synthetic data.

The other side: The AI companies have two chief legal arguments.

Many maintain that their broad use of copyrighted material is legal under the doctrine of "fair use," which courts apply using a complex four-part standard.
However, as Giordano notes, "the public status of copyrighted material" is not one of those factors.
A decade ago, the Google Books decision held that Google's use of "text snippets" to catalogue published works was an acceptable fair use, and AI companies often point to Google's win to back their argument.
The second argument is that copyright is not an issue in AI training because AI systems don't copy material: They just "learn" from it the way a human might.

Reality check: AI companies often refuse to say which "publicly available" data they are using, with OpenAI and others describing their sources as a competitive trade secret.

In a recent video interview with the Wall Street Journal, OpenAI's Mira Murati said the data used to train the Sora video engine came from publicly available sources — but she declined to be more specific, including refusing to say whether YouTube videos were included.
In another recent interview, YouTube CEO Neal Mohan told Bloomberg's Emily Chang that if OpenAI was training Sora on its videos, that would be a "clear violation" of YouTube's terms of service.

Privacy is also in play. Data that is "publicly available" but tucked away in some obscure site could be much more widely circulated by an AI chatbot trained on such information.

AI didn't create this problem — the internet kicked it off, as records once confined to a courthouse or clerk's office became digitized and then amalgamated by data brokers.
However, AI chatbots add a new, accessible means to unearth and widely share once-obscure information.

What the companies say: OpenAI tells Axios it uses a mix of licensed data in addition to information from the internet, and that it provides a way for site owners to block their data from being used to train future models.

As for what constitutes "publicly available" content, OpenAI says, "We only use publicly available information that is freely and openly available on the internet — for example, we do not use information that is password protected or behind paywalls."

Google, a company representative tells Axios, trains its models primarily on publicly available data from the internet, with sources that include blog posts, media transcripts and public conversation forums.

It said it also provides a mechanism for publishers to indicate that they don't want models trained on their content.

Meta says, in its responsible use guide for Llama 2, "The training datasets for Llama are sourced from a broad set of diverse, publicly available online data."

Microsoft "uses a variety of data sources, including publicly available information, in a manner consistent with copyright and IP laws," a company representative tells Axios.

2. Meta to broaden labeling of AI-created content

Illustration: Sarah Grillo/Axios

Meta will begin labeling a wider range of video, audio and image content as "Made with AI" starting in May, Ryan reports.

Why it matters: Meta admits its current labeling policies are "too narrow" and that a stronger system is needed to deal with today's wider range of AI-generated content and other manipulated content, such as a January video which appeared to show President Biden inappropriately touching his granddaughter.

The labels could be generated through self-disclosure when a user posts content, as a result of advice from fact-checkers, or via Meta detecting invisible markers of AI-made content.

Context: Meta's new policy is a response to feedback from its independent Oversight Board, which urged an update to the current policy. Under present rules, the only videos that get a "Made with AI" label are those that make a person appear to say something they didn't say.

Present "manipulated media" rules apply only to "videos that are created or altered by AI to make a person appear to say something they didn't say."
Starting in February, the company began to add "Imagined with AI" labels to photorealistic images made with the Meta AI feature.

What they're saying: "We'll keep this content on our platforms so we can add labels and context," Monika Bickert, vice president of content policy, writes in a blog post, arguing that additional transparency is better than censoring content.

But Meta will "remove content, regardless of whether it is created by AI or a person, if it violates our policies against voter interference, bullying and harassment, violence and incitement, or any other policy," Bickert writes.

3. Training data

A Washington state judge refused to allow an AI-enhanced version of a cellphone video as evidence in a criminal trial, saying the algorithm uses "opaque methods to represent what the AI model 'thinks' should be shown." (NBC News)
The inclusion of AI-generated electronic texts in Google Books could erode the value of a key dataset used by researchers. (404 Media)
Trading places: Vimeo named Philip Moyer, a former Google Cloud AI executive, as its new CEO.
Also, Thomas Zacharia, the former head of Oak Ridge National Laboratory, has joined AMD as senior VP of strategic technology partnerships and public policy, focused on expanding strategic AI relationships.

4. + This

Check out these "cow shoes" used by some Prohibition-era moonshiners seeking to hide their footprints.

Thanks to Scott Rosenberg and Megan Morrone for editing this newsletter and to Caitlin Wolper for copy editing it.