
Illustration: Eniola Odetunde/Axios
The next generation of a machine learning language algorithm can better understand a person's instructions and intentions, OpenAI reported in a paper posted on its site today.
Why it matters: What's called "aligned" AI aims to be more honest and less toxic, biased or inappropriate — what most people consider required features for any artificial general intelligence that might be created in the future.
- "As AI systems become more powerful and take on more responsibility and make more consequential decisions, it will be increasingly important that it follow the intentions of operators — stated and unstated," says OpenAI co-founder and chief scientist Ilya Sutskever.
How it works: InstructGPT is a "fine-tuned" version of GPT-3, an algorithm that predicts what word comes next in a sentence or phrase.
- But that is "very different than safely and effectively doing a task the user wants to perform," says Jan Leike, who heads alignment research at OpenAI.
- For example, Leike says if you ask GPT-3 to explain the Moon landing to a 6-year-old, it might explain the theories of gravity, relativity and the Big Bang. "It's trying to guess a pattern if this was text on a random webpage."
- But InstructGPT returns: "People went to the Moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them."
"To get to what humans want, [you] have to get humans in the system."
— Jan Leike, OpenAI
The researchers trained InstructGPT using humans who annotated and rated the models' responses to signal whether it was doing something close to what humans want.
- The results suggest InstructGPT is better at following instructions and slightly less toxic but not less biased than GPT-3.
- The model is now the default one for OpenAI's API.
Yes, but: The model still makes mistakes and is "far from fully aligned."
- It struggles with prompts based on false premises, and it can be overly deferential, giving multiple possible answers, even if one is clear from the context.
- InstructGPT's behavior is influenced by the human judgment it receives — and therefore the identities, beliefs and experiences of the people who train the models. More research is needed to make the models inclusive.
- And it can still create toxic or biased outputs if instructed, making it vulnerable to disinformation and underscoring the need for guardrails, Leike says.
Go deeper: The new version of GPT-3 is much better behaved (and should be less toxic (MIT Tech Review)