The push to make big AI small
An effort to develop smaller, cheaper AI models could help put the power of machine learning in the hands of more people, products and companies.
Why it matters: Large language models get most of the AI attention, but even those that are open source aren't practical for many AI researchers who want to iterate on them to create their own models for new tools and products.
- Some LLMs use more than 100 billion parameters to generate an output for a prompt and require significant and expensive computing power to train and run.
The intrigue: For the AI neural networks that fueled the latest AI wave, bigger has generally meant better — larger models trained by using more data seem to perform better.
- But,"[i]t's often the case that you can create a model that is a lot smaller that can do one thing really well," says Graham Neubig, a computer science professor at Carnegie Mellon University. "It doesn't need to do everything."
- Using LLM for some tasks is like "using a supercomputer to play Frogger," writes Matt Casey at Snorkel.
How it works: Researchers are trying to shrink models to have fewer parameters but perform well on specialized tasks.
- One approach is "knowledge distillation," which involves using a larger "teacher" model to train a smaller "student" model. Rather than learn from the dataset used to train the teacher, the student mimics the teacher.
- In one experiment, Neubig and his collaborators created a model 700 times smaller than a GPT model and found it outperformed it on three natural language processing tasks, he says.
- Microsoft researchers recently reported being able to distill a GPT model down to a smaller one with just over 1 billion parameters. It can perform some tasks on par with larger models, and the researchers are continuing to hone them.
Yes, but: The student may be able to perform only as well as its teacher on a wider range of tasks (though that isn't always the case).
- Student models can mimic their teachers, but some research shows they don't necessarily match them. "There is a long list of tasks — the more rare tasks — where it is still not as good," says Sara Hooker, who leads Cohere for AI, Cohere's nonprofit AI research lab.
- "There's a lot we don't know — how do we make sure that the data that we get from a large model was diverse enough to cover all of these tasks?" she says.
Between the lines: AI researchers have been focused on models more than data, Hooker says.
- But they've now "come to the conclusion that data matters again."
What to watch: Distillation is somewhat of a legal gray area.
- For example, some terms of service forbid creating a model that competes with the foundation model. And it may be unclear how a competing model is defined.
- Many models that showcase distillation are built in academia but sample from proprietary models, which restricts what they can release, Hooker says.
- The White House executive order on AI issued in October contains a requirement for reporting a model if it passes a threshold for the amount of computing power it requires — a proxy for its size.
- Some AI experts predict that distillation techniques will take on a bigger role in 2024 in companies looking to deploy AI for specific tasks.
The bottom line: "This new wave of research ... is rogue in the sense that it's addressing and kind of threatening a trend, which has been pretty much over the last decade that we just got bigger and bigger and bigger," Hooker says.
- "Can we get away with something smaller? Do we need models to be big? And that's why it's so exciting."