1 hour ago - Technology

Anthropic scientists map a language model's brain

Researchers at Anthropic have mapped portions of the "mind" of one of their AIs, the company reported this week, in what it called "the first ever detailed look inside a modern, production-grade large language model."

Why it matters: Even the scientists who build advanced LLMs like Anthropic's Claude or OpenAI's GPT-4 can't say exactly how they work or why they provide a particular response — they're inscrutable "black boxes."

  • The new work from Anthropic raises the prospect that generative AI programs like ChatGPT might some day be much easier to understand and control — making them both more useful and, with luck, less dangerous.

How it works: Using a technique called dictionary learning, Anthropic's team found a way to identify sets of neuron-like "nodes" in their LLM that the program associated with specific "features."

  • The features were places, things, concepts — "a vast range of entities like cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology), and programming syntax (function calls)," Anthropic said in a post about the project.
  • Features could be located near related terms and ideas, they found: "Looking near a 'Golden Gate Bridge' feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film 'Vertigo.'"

Between the lines: Once a particular feature had been identified, the researchers could directly manipulate them, "artificially amplifying or suppressing them to see how Claude's responses change," per Anthropic.

  • Instead of having to teach or retrain the model or give it feedback, they could directly adjust its dials.
  • Anthropic, which aims to "to ensure transformative AI helps people and society flourish," sees this work as a foundation for building safer AI.

Yes, but: The research is expensive — and each LLM may need to have its features catalogued independently.

  • The Anthropic project identified "millions" of features in the Claude Sonnet model they studied, but the researchers write that's just a fraction of the whole model.
  • "We don't have an estimate of how many features there are or how we'd know we got all of them (if that's even the right frame!)," the research paper says — and "getting all of them" might require even more computing power than training the model in the first place, an already costly venture.

The bottom line: As generative AI becomes easier to directly program, its guardrails might also become more reliable.

  • Then again, in the wrong hands, the same dials that make the models safer could be used to amp up their capacity for harm.
