Axios AI+

November 05, 2024

Ina Fried

🗳️ Axios' live election maps: Keep this page in your tabs tonight to follow results from key races across the country.

Today's newsletter is 1,166 words, a 4.4-minute read.

1 big thing: What Meta's AI knows about you

As Facebook's parent company has aggressively injected Meta AI into its products, it has fueled its systems with a bounty of customer data.

Why it matters: Meta knows a ton about you and billions of other people. It's counting on that knowledge to build AI that's both powerful and relevant — and it's also limiting customers' ability to say "no" to this use of their information.

Catch up quick: Meta AI has been among the fastest growing of the consumer AI systems, with more than 500 million people using the service each month, company leaders said on last week's earnings call.

Meta's 20-year history has always been marked by vigorous collection of users' personal data for personalizing content feeds and targeting ads.

Zoom in: Meta says it can use any data shared publicly to Facebook and Instagram to train its AI systems.

That means the company will use anything that you share with everyone on Facebook (not just with your friends or friends of friends), and anything posted to a standard, non-private Instagram account.
Meta also says that "your interactions with AI features can be used to train AI models. Examples include messages to AI chats, questions you ask and images you ask Meta AI to imagine for you."
That also includes photos taken with Meta Ray-Ban glasses that are used as part of an AI query.
Notably, Meta is also not letting customers opt out of having their data used for training, except for those in Brazil and Europe.

Yes, but: The company does allow customers to delete data from their conversations with the Meta AI chatbot.

Meta says that any content that users delete — either from conversations with Meta AI or from public posts on Facebook and Instagram — will not be used for future training.

The big picture: Meta has a bold vision for generative AI that involves using the technology to create an array of personalized content.

The most widely used component is the Meta AI chatbot, which is now accessible from within Facebook, Messenger and WhatsApp (as well as on the Ray-Ban glasses).

Meta is also summarizing comments with AI and has said it will start using generative AI to create new content to fill customers' feeds.

For starters, it's going to offer up AI-generated images based on customers' real photos. Meta says users will be able to further share those images if they like them, or turn them off if they don't.
The company has tested a number of other ideas, including a since-ended trial that had Meta AI posting comments in existing threads. That test included a well-publicized case in which the AI chatbot appeared to post as if it were the parent of a disabled child.

Between the lines: Because Meta's services are largely free and ad-supported, the company benefits from people spending more time on them.

In recent months, Meta has been sharing more existing content that comes from outside one's social network that its algorithms believe people will find engaging. The next step, already telegraphed, is using AI to create content it expects you will like.

What we're watching: Meta is likely to be an early test case for just how compelling AI-generated content can be, especially when turbocharged with a ton of personal information.

Also worth paying attention to is how the ads Meta shows customers evolve in this world — including whether they begin to be personalized using generative AI.

Previously in this series: What AI knows about you

2. Exclusive: AI training depends on premium content, study finds

Sara Fischer

Illustration of many 0s and 1s cut from newsprint — Illustration: Natalie Peeples/Axios

Leading AI companies such as OpenAI, Google and Meta rely more on content from premium publishers to train their large language models (LLMs) than they publicly admit, according to new research from executives at Ziff Davis, one of the largest publicly-traded digital media companies.

Why it matters: Publishers believe that the more they can show that their high-end content has contributed to training LLMs, the more leverage they will have in seeking copyright protection and compensation for their material in the AI era.

Zoom in: While AI firms generally do not say exactly what data they use for training, executives from Ziff Davis say their analysis of publicly available datasets makes it clear that AI firms rely disproportionately on commercial publishers of news and media websites to train their LLMs.

The paper — authored by Ziff Davis' lead AI attorney, George Wukoson, and its chief technology officer, Joey Fortuna — finds that for some large language models, content from a set of 15 premium publishers made up a significant amount of the data sets used for training.
For example, when analyzing an open-source replication of the OpenWebText dataset from OpenAI that was used to train GPT-2, executives found that nearly 10% of the URLs featured came from the set of 15 premium publishers it studied.

Context: Ziff Davis is a member of the News/Media Alliance (NMA), a trade group that represents thousands of premium publishers. The new study's findings resemble those of a research paper submitted by NMA to the U.S. Copyright Office last year.

That study found that popular curated datasets underlying major LLMs significantly overweight publisher content "by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web."

Zoom out: Unlike most of its biggest publishing competitors, such as Dotdash Meredith and Condé Nast, Ziff Davis has yet to strike a big data licensing or content sharing deal with a major AI firm.

Between the lines: The report also finds that a few public data sets used to train older LLMs are still being used today to train newer models.

The paper's authors suggest the disproportionate reliance on premium publisher content to train older large language models extends to newer LLMs.

The big picture: Most news companies that are making deals with AI firms aren't focusing on data training deals any more, since they tend to be one-time windfalls.

Instead, they are cutting longer-term deals to provide news content for generative AI-powered chatbots to answer real-time queries about current events.
A high-profile lawsuit brought by the New York Times against OpenAI and Microsoft could help define for the broader industry whether scraping publisher content without permission and using it to train AI models and fuel their outputs is a copyright violation.

3. Training data

Election officials say bad actors are using AI to target Latino voters with election misinformation. (Axios)
Sources say OpenAI is in early talks with California's attorney general's office to transform from a non-profit to a for-profit business. (Bloomberg)
A University of Washington study found significant racial and gender bias when three large open-source LLMs were used for resume screening. (GeekWire)
Caitlin Kalinowski, who has been leading Meta's AR glasses efforts, is joining OpenAI to oversee work in robotics and consumer hardware. (X)
Everything you need to know about election threats today. (Axios)

4. + This

An intriguing idea. And, for those who don't remember (or weren't born yet), this was the ill-fated PowerMac G4 Cube.

Thanks to Megan Morrone and Scott Rosenberg for editing this newsletter and to Anjelica Tan for copy editing it.