Sep 25, 2025 - Technology

Bots are elbowing out humans in skill at office work

Megan Morrone

A bar chart comparing the performance of various AI models on 220 digital work tasks against human experts, from a study published September 2025. 48% of human graders preferred the deliverables from Claude Opus 4.1 over experts. GPT-5 follows at 39%. Other models include OpenAI o3 (34%), OpenAI o4-mini (28%), Gemini 2.5 Pro (26%), Grok 4 (24%), and GPT-4o (12%). — Data: OpenAI; Chart: Axios Visuals

A new OpenAI tool to evaluate AI model performance on "economically valuable work" shows that the bots are gaining on us when it comes to common job tasks.

Why it matters: We're at an AI reckoning, where leaders are trying to justify investments without effective tools to measure returns.

A recent MIT study showing that most AI projects fail launched a debate about its techniques, but also exposed the challenges in measuring returns on these massive investments.

Driving the news: On Thursday OpenAI introduced GDPval-v0, a new way to measure how well AI models perform what it calls "authentic work deliverables," like creating legal briefs, engineering blueprints and nursing care plans.

The "GDP" in GDPval stands for Gross Domestic Product, which OpenAI says researchers used as the key economic indicator for the evaluations.
The tasks the company tested came from occupations in the industries that contribute most to GDP.

What they did: Researchers looked at around 1,300 work tasks across 44 occupations, in nine business sectors that each make up more than 5% of U.S. GDP.

Expert graders compared AI and human deliverables using detailed rubrics to decide which was better.
"We finally have a way to measure how our models perform in the real world — not just on academic tests — which is a key way for us to measure progress towards our goal of AGI," OpenAI researcher Tejal Patwardhan told Axios.

Between the lines: OpenAI didn't just look at its own models.

Researchers also looked at how Anthropic's Claude, Google's Gemini, and xAI's Grok compared to human workers.

What they found: Today's leading models are approaching parity with human professionals on many tasks, and the gains are accelerating.

In blind tests of 220 tasks, Claude Opus 4.1 edged out others, with its outputs rated as good as — or better than — human experts 47.6% of the time.
OpenAI's GPT-5 came in at a close second, excelling in domain-specific knowledge.
Research found that frontier models can complete the GDPval-v0 tasks roughly a hundred times faster and cheaper than experts.

Yes, but: The speed and cost numbers are based on model inference time and API billing rates, and don't capture the cost of human insight required in a real world setting, per the research.

What they're saying: Just because AI models can complete these tasks better, cheaper and faster doesn't mean it's going to edge all humans out of the workforce anytime soon, OpenAI chief economist Ronnie Chatterji told Axios.

"Your job is going to be different with a different set of tasks, maybe, than it was yesterday," Chatterji says. "It's gonna be hard to track the direct impact on the job market."
"The data shows that AI models are increasingly capable of doing a lot of the work that humans do right now," he added. "So that's where I think the economic value is coming from — as a complement to workers."

Stunning stat: Performance has more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025).

"We find that the rate of improvement is super-linear; in other words, the gains are accelerating," OpenAI says in the report.

Add Axios on Google

Bots are elbowing out humans in skill at office work

What to read next