Jun 13, 2024 - Technology

When AI-produced code goes bad

Megan Morrone

Illustration of binary code with several numbers turned to form a menacing face — Illustration: Annelise Capossela/Axios

The same generative AI tools that are supercharging the work of both skilled and novice coders can also produce flawed, potentially dangerous code.

Why it matters: Multiple studies have shown that more than half of programmers are using generative AI to write or edit the software that runs our world — and that number keeps rising.

Catch up quick: AI coding assistants can do everything from helping developers with drudge work all the way to producing whole codebases from brief prompts.

In 2022, GitHub found that developers who used its AI coding assistant worked 55% faster than those who didn't.
According to an April 2024 poll from Gartner, 75% of software engineers will use generative AI code assistants by 2028. That's up from less than 10% of coders who used such tools in early 2023.
All the tech giants and leading AI providers offer code assistants. OpenAI's ChatGPT can code, and so can Meta's Llama 3. Microsoft offers GitHub Copilot, Google's tool is called Gemini Code Assist and Amazon has AWS' CodeWhisperer.

Yes, but: The productivity gains come with a price.

One study from Stanford found that programmers who had access to AI assistants "wrote significantly less secure code than those without access to an assistant."
Another study from researchers at Bilkent University in 2023 found that 30.5% of code generated by AI assistants was incorrect and 23.2% was partially incorrect, although these percentages varied among different code generators.
Research from code reviewing tool GitClear found that the rise of AI coding assistants in 2022 and 2023 correlated with the rise of code that had to be fixed two weeks after it was authored, and if the trend continues in 2024, "more than 7% of all code changes will be reverted within two weeks."
When ZDNet put general purpose chatbots through a series of coding tests (like "write a Wordpress plugin"), Microsoft Copilot failed all of them. Google Gemini Advanced, Meta AI and Meta Code Llama failed most of them. Only ChatGPT passed them all.

Programmers sense there's trouble.

CodeSignal, a coding skills assessment and AI learning tools platform, found that over half of developers have concerns about the quality of AI generated code.

Of course, human coders mess up, too.

Alastair Paterson, CEO of Harmonic Security, told Axios that many of these models have equivalent skills to a junior developer, but they also can make different kinds of mistakes.
"The large language model approach is fantastic at some tasks and less good at some other things that you'd think it would be really, really good at," Paterson said. "They make strange logical errors in numbers and loops."
"The one thing that the large language models are very bad at is doing math," said CodeSignal CEO Tigran Sloyan.
Paterson said that many projects require big, complex architectural decisions that "these systems are just not capable of thinking about at the moment."
"A lot of the times the reason that they produce not very good code is that what was asked of them was not correct," Sloyan told Axios.

AI code generators aren't yet able to generate programs from scratch without input from humans, but as these tools get better, the problems might get bigger.

Right now, bad AI-generated code that's not caught by a human usually just makes for messy code libraries or minor problems rather than disasters.
Lee Atchison, former Amazon technical program manager and author of the O'Reilly book "Architecting for Scale," wrote in March that "code complexity and the support costs associated with complex code have increased in recent years in large part due to the proliferation of AI-generated code use."

In other words, generative AI tools might save time and money upfront in code creation and then eat up those savings at the other end.

That would make them less of a revolutionary breakthrough than the latest in a long line of innovations that help the software industry deploy fast and worry about fixing things later.

The big picture: There haven't yet been any public disasters related to unchecked AI-generated code, but Sloyan said it's only a matter of time.

Problems might arise when AI programs are directing other AI programs to write code.
"Looking ahead a few years is probably where I would worry," Paterson said. "If you've got autonomous actions being taken by some of these agents that are under full AI control with limited human input, I think that is where things start to get more interesting."

The other side: "I think we're some way off from some sort of AI apocalypse," Paterson said. "These tools ultimately are still just tools, and we've got a pretty good understanding of their limitations."

Editor's note: This story has been corrected to reflect that only ChatGPT passed all the ZDNet coding tests, while Google Gemini Advanced — like Meta AI and Meta Code Llama — failed most of them, and only Microsoft Copilot failed them all.

Add Axios on Google

When AI-produced code goes bad

What to read next