Shedding light on AI's black box
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Sarah Grillo/Axios
Scientists are opening generative AI's black box and beginning to understand the models' inner workings.
Why it matters: The prospect of harnessing genAI to make decisions and perform tasks is pushing researchers to better understand how AI systems work — and how they might be controlled.
- "We can't base our entire understanding of [large language models] on their inputs and outputs alone," says Marissa Connor, a machine learning researcher at the Software Engineering Institute at Carnegie Mellon University.
- "If you're relying on AI models to work in high-impact situations — like diagnosing medical conditions — then it is important to understand why they have a specific output."
Catch up quick: Unlike computer programs that use a set of rules to produce the same output each time they are given one input, genAI models find patterns in vast amounts of data and produce multiple possible answers from a single input.
- The internal mechanics of how an AI model arrives at those answers aren't visible, leading many researchers to describe them as "black box" systems.
- It's important to look into the black box model and "understand model bias, understand model decision-making and ensure safe system performance," Connor says.
Zoom in: One way AI researchers are trying to understand how models work is by looking at the combinations of artificial neurons that are activated in an AI model's neural network when a user enters an input.
- These combinations, referred to as "features," relate to different places, people, objects and concepts.
- Researchers at Anthropic used this method to map a layer of the neural network inside its Claude Sonnet model and identified different features for people (Albert Einstein, for example) or concepts such as "inner conflict."
- They found that some features are located near related terms: For example, the "inner conflict" feature is near features related to relationship breakups, conflicting allegiances and the notion of a catch-22.
- When the researchers manipulated features, the model's responses changed, opening up the possibility of using features to steer a model's behavior.
OpenAI similarly looked at a layer near the end of its GPT-4 network and found 16 million features, which are "akin to the small set of concepts a person might have in mind when reasoning about a situation," the company said in a post about the work.
- They found features related to rhetorical questions, price increases and human imperfection, and developed new metrics for evaluating features.
Yes, but: The papers from OpenAI and Anthropic acknowledge it is early days for the work, especially for how it might apply to AI safety.
- One issue OpenAI flagged is the difficulty interpreting many features because they have no clear pattern or there are spurious activations of the neurons.
- And while the research looks at larger language models than previous work, it examines just a slice of these massive models and captures a fraction of the concepts represented in a model's billions of neurons activated across a network's many layers.
The latest: Google DeepMind tried to tackle that limitation in its recent release of Gemma Scope, a tool that looks across all of the layers in a version of the company's Gemma model, covering 30 million features.
The big picture: The unknowns about what happens in a large language model between input and output echo observations in other areas of science where there is an "inexplicable middle," says Peter Lee, president of Microsoft Research.
- In biology, there's an understanding of DNA — including the fundamental physics underlying its chemistry — and descriptions of the behaviors of animals, microbes, plants and people. But in between are some of biology's biggest and most complicated questions, including how genetic, molecular and environmental processes shape a cell's development.
"My claim would be generative AI has created for scientists another example of that kind of problem," Lee says.
- "We know with increasing precision some of the very basic mechanisms," he says. And then on the other end, "we are getting more and more experience actually using AI systems."
- But there's something in the middle: "Why is it that at a certain scale the model goes from not understanding what is and isn't a joke to suddenly knowing what is and isn't a joke?"
What to watch: The question of how a model works leads to how it is evaluated, and that itself has become a major research focus.
- With genAI, "we are now, for the first time, allowing ourselves to fantasize about the possibility of computers doing highly skilled knowledge work," Lee says. That sort of work, he adds, isn't about achieving perfection but being effective and trustworthy.
- While there is increased understanding of the underlying mathematics of how AI systems work against certain benchmarks, evaluating their work "starts to veer more in the direction of how do we evaluate a person that we are employing to do it?"
The bottom line: At the end of the day, Lee says, how to evaluate an AI model is "as mysterious as how to evaluate a human being."
