Axios AI+

A floating, translucent blue 3D render of the human brain.

August 31, 2023

Hi, it's Ryan. Today's AI+ is 1,085 words, a 4-minute read.

1 big thing: Open web revolts against AI giants

Nearly 20% of the top 1,000 websites in the world are blocking crawler bots that gather web data for AI services, according to new data from Originality.AI, an AI content checker, Axios' Sara Fischer reports.

Why it matters: In the absence of clear legal or regulatory rules governing AI's use of copyrighted material, websites big and small are taking matters into their own hands.

Driving the news: OpenAI introduced its GPTBot crawler early in August, declaring that the data gathered "may potentially be used to improve future models," promising that paywalled content would be excluded and instructing websites in how to bar the crawler.

Soon after, several high-profile news sites, including the New York Times, Reuters and CNN, began blocking GPTBot, and many more have since followed. (Axios is among them.)

By the numbers: Of the 1,000 most visited websites in the world, the number of sites blocking OpenAI's ChatGPT bot has increased from 9.1% on Aug 22 to 12% on Aug 29, per Originality.AI's data.

The biggest sites blocking ChatGPT's bot are Amazon, Quora and Indeed. Bigger websites are more likely to have already blocked AI bots, the data shows.
The Common Crawl Bot — another crawler that regularly gathers web data used by some AI services — is being blocked 6.77% of the time across the top 1,000 sites.

How it works: Any page you can access from a web browser can also be "scraped" by a crawler — which operates just like a browser but stores the material in a database instead of displaying it to a user.

That's how search engines like Google gather their information.
Site owners have always had the ability to post instructions that tell these crawlers to go away — but cooperation is strictly voluntary, and bad actors can ignore the instructions.

The big picture: Google and other web firms see their data crawlers' work as fair use, but many publishers and intellectual property holders have long objected, and the company has faced multiple lawsuits over the practice.

The rise of large language models and generative AI has pushed this question back into the spotlight, as AI companies send out their own crawlers to collect data to train their models and provide fodder for their chatbots.

Reality check: Some publishers saw at least some value in letting search crawlers access their sites since Google and other search sites sent users to their ad-supported sites.

But in the AI era, publishers are more aggressively blocking crawlers because there's no upside, for now, in handing over their data to AI companies.
Many media companies are currently in talks with AI firms about licensing their data to AI companies for a fee, but those talks are in early stages.
In the interim, some websites and intellectual property holders are taking or considering legal action against AI companies that may have used their data without permission.

Our thought bubble: Media outfits that feel they were taken by Google over the past two decades are eyeing the rapid commercialization of AI services like OpenAI with hostility and a "we won't get fooled again" attitude.

OpenAI is reportedly on track to bring in more than $1 billion in revenue over the next year, per The Information.

Zoom in: News companies, specifically, are struggling to find the right balance between embracing AI and resisting it.

On one hand, the industry is desperate to find innovative ways to improve profit margins in their labor-intensive business.
On the other, introducing AI into a newsroom's workflow, at a time when trust in news companies is at a historic low, presents challenging ethical questions.

What to watch: If too much of the web blocks AI crawlers, their owners could find it harder to refine and update their AI products — and good data is getting tougher to find.

Originality.AI found that the rate of blocking the GPTBot among the top 1,000 websites is increasing roughly 5% per week.

Go deeper: Newsrooms grapple with rules for AI

2. Meta's tool to detect computer vision bias

Meta on Thursday released a new tool designed to spot racial and gender bias within computer vision systems, Axios' Ina Fried reports.

Why it matters: Many computer vision models have shown systematic bias against women and people of color. The hope is that improved tools will enable developers to better detect shortcomings and address them.

Details: Meta is offering researchers access to FACET (FAirness in Computer Vision EvaluaTion), a tool that evaluates how well computer vision models perform across various characteristics including perceived gender and skin tone.

FACET was based on more than 30,000 images containing 50,000 people, all of which were tagged by experts across the different categories.
Meta says FACET can be used to answer questions like whether an engine is better at identifying skateboarders when their perceived gender is male or whether a system is better and identifying people with light skin and dark skin, and whether such problems are magnified when a person has curly, rather than straight hair.

What they're saying: "Benchmarking for fairness in computer vision is notoriously hard to do," Meta said in a blog post. "The risk of mislabeling is real, and the people who use these AI systems may have a better or worse experience based not on the complexity of the task itself, but rather on their demographics."

"We want to continue advancing AI systems while acknowledging and addressing potentially harmful impacts of that technological progress on historically marginalized and underrepresented communities, especially," Meta chief ethicist Chloé Bakalar said in a statement to Axios.

3. Training data

The U.S. Copyright Office is requesting public comment on the AI implications for copyright law, in light of copyright claims involving AI authorship and copyright infringement claims relating to AI-authored content. (Federal Register)
Meanwhile, OpenAI laid out its defense in two suits brought by authors claiming the company infringed on their copyrights by training AI models using their writing. (Ars Technica)
Google salary data, self-reported by 12,000 employees, estimates median compensation at Google was $279,802 in 2022. (Insider)
ErnieBOT — a Chinese language chatbot from tech giant Baidu— is now available to the public, the company announced on Twitter, after six months of availability only to registered enterprises, the company announced Wednesday.
YouTube policy changes have reduced extremist "rabbit holes," but the site still helps extremist channels build audiences, according to a new academic study. (Science Advances)

4. + This

People celebrating a tomato festival in Spain dumping tomatoes on themselves. — Photo: Burak Akbulut/Anadolu Agency via Getty Images

I leave you an image from "Tomatina" — Spain's annual tomato-based street battle, where 15,000 or so people pelted each other with 120 tons of tomatoes dumped in streets of Buñol.

Thanks to Scott Rosenberg for editing and Bryan McBournie for copy editing this newsletter.