Exclusive: Mozilla establishes AI data-sharing collective
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Annelise Capossela/Axios
Mozilla is leading a new effort to ensure a more representative set of data powers AI models around the globe, the company first tells Axios.
Why it matters: Generative AI models are only as good as the data that underlies them — and many languages and regions are vastly underrepresented.
Driving the news: The Mozilla Data Collective, as the project is known, is formally launching in time for the upcoming Mozilla Festival in Barcelona from Nov. 7-9.
- The collective is starting with a set of voice data Mozilla collected from more than a million people around the world, with more than 30,000 hours of speech in 300 different languages.
- The goal is to expand that to a broad array of other sources.
- Those with data sets can set a price for using the data set, or choose other limits on who can use the data, under what terms or for what purposes.
- For data sets that do seek compensation, the owners get 100% of their data fee. Mozilla then adds an additional 5% "platform charge" to the data buyer. That money is reinvested in the Mozilla Data Collective.
What they're saying: "We want to build a world where there is actually fair value exchange around data and a less extractive set of norms," E.M. Lewis-Jong, founder and VP of the Mozilla Data Collective, told Axios. "We want to build a system in which communities benefit from their data in the ways that they want to benefit."
Between the lines: Getting paid directly, though, isn't always what communities want when they share their data.
- Some want the data only used for projects in their region, some want use limited to research, while still others are fine with their data being commercially used, so long as it's not by a tech giant.
Yes, but: It's one thing to put bespoke terms into a license, it's another to be able to enforce it. Lewis-Jong acknowledges such mechanisms don't yet exist, "but that is exactly what we're trying to build."
The big picture: The goal of getting these data sets into the world, Lewis-Jong said, is to create models that represent a broader swath of the global population. Today's AI models are frequently powered by data sets that overrepresent English, and American English at that.
- "It's really important for models to be trained on data that is representative," they said. "If the data that goes in is junk and it is not representative, then your model outputs are not going to be solid."
Go deeper: New wallet app lets users store their own training data
