May 24, 2024 - Technology

Anthropic scientists map a language model's brain

Animated illustration of a block of text being highlighted to reveal a highlighted portion resembling a pink brain, followed by an arrow clicking and selecting "copy" from a drop-down menu

Illustration: Annelise Capossela/Axios

Researchers at Anthropic have mapped portions of the "mind" of one of their AIs, the company reported this week, in what it called "the first ever detailed look inside a modern, production-grade large language model."

Why it matters: Even the scientists who build advanced LLMs like Anthropic's Claude or OpenAI's GPT-4 can't say exactly how they work or why they provide a particular response — they're inscrutable "black boxes."

  • The new work from Anthropic raises the prospect that generative AI programs like ChatGPT might some day be much easier to understand and control — making them both more useful and, with luck, less dangerous.

How it works: Using a technique called dictionary learning, Anthropic's team found a way to identify sets of neuron-like "nodes" in their LLM that the program associated with specific "features."

  • The features were places, things, concepts — "a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls)," Anthropic said in a post about the project.
  • Features could be located near related terms and ideas, they found: "Looking near a 'Golden Gate Bridge' feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film 'Vertigo.'"

Between the lines: Once a particular feature had been identified, the researchers could directly manipulate them, "artificially amplifying or suppressing them to see how Claude's responses change," per Anthropic.

  • Instead of having to teach or retrain the model or give it feedback, they could directly adjust its dials.
  • Anthropic, which aims to "to ensure transformative AI helps people and society flourish," sees this work as a foundation for building safer AI.

Yes, but: The research is expensive — and each LLM may need to have its features catalogued independently.

  • The Anthropic project identified "millions" of features in the Claude Sonnet model they studied, but the researchers write that's just a fraction of the whole model.
  • "We don't have an estimate of how many features there are or how we'd know we got all of them (if that's even the right frame!)," the research paper says — and "getting all of them" might require even more computing power than training the model in the first place, an already costly venture.

The bottom line: As generative AI becomes easier to directly program, its guardrails might also become more reliable.

  • Then again, in the wrong hands, the same dials that make the models safer could be used to amp up their capacity for harm.
Go deeper