Large language models learn to speak biology
AI systems that have already made strides learning the language of humans are being trained to decipher the language of life encoded in DNA — and to use it to try to design new molecules.
Why it matters: AI that can make sense of biology's information could help scientists to develop new therapeutics and to engineer cells to produce biofuels, materials, medicines and other products.
Background: Scientists have for decades worked to reverse engineer cells in order to design new proteins and improve molecules found in nature, increasingly with the help of computational tools.
- Other researchers have scoured Earth for compounds made by bacteria, fungi, plants and other organisms that can be useful for particular purposes but haven't been discovered. Both approaches have yielded new cancer therapeutics and products.
- "But at some point, we run out of low-hanging fruit to pick," says Kyunghyun Cho, a professor of computer science and data science at New York University and senior director of Frontier Research at Prescient Design, which is part of Genentech.
Now, generative AI models — similar to the large language model (LLM) that powers ChatGPT — are being developed to understand the rules and relationships of DNA, RNA and proteins, and the many functions and properties they produce.
How it works: Humans arrange the 26 letters in the modern English alphabet into roughly — and arguably — about 500,000 words.
- LLMs are given text that they then split into characters, words or subwords, known as tokens.
- The AI model then determines the relationships among these tokens and uses that information to generate original text.
The language of biology contains far fewer letters but produces many more "words" in the form of proteins.
- The genetic information carried in DNA is encoded in four molecules: A (adenine), C (cytosine), T (thymine) and G (guanine).
- Three-letter combinations of these four basepairs, called codons, give rise to 20 different amino acids, some or all of which are strung together in different orders to make up proteins.
- There are more than 200 million known proteins. AlphaFold, an AI system developed by DeepMind, can predict the structure of a protein from its amino acid sequence — one of biology's biggest challenges and time-consuming tasks.
- But many orders of magnitude more proteins are theoretically possible.
- That leaves a vast space to explore for scientists who want to develop new proteins that have the properties they want for a novel drug or to engineer cells to perform different tasks.
What's happening: AI models are being used to map that space to identify changes in DNA or RNA that underpin disease or alter key processes in a cell — and to use that information to design new proteins. But scientists doing that face several hurdles.
- They must figure out the best way to break biology's language down into tokens that the LLM can work with.
- They must ensure the AI is able to see the relationships between genes and elements of genes that affect one another from different places in a long stretch of DNA, says Joshua Dunn, a molecular and computational biologist at Ginkgo Bioworks, which uses AI to drive some of its gene designs. It's like having to pull sentences from different parts of a book to understand its meaning.
- Another consideration is that if you read DNA from different starting points, you can wind up with different proteins — if you start mid-sentence, you get a different story than if you start at the sentence's beginning.
- And while most proteins are encoded in standard genetic code, others are transcribed by different "readers" in cells. "That means there are a whole lot of different languages being spoken at the same time," Dunn says.
Dunn says he is "extremely optimistic that large language models are going to figure out some of this because they're actually very good at understanding different scales of meanings spoken in different languages."
- But there are open questions about how to tokenize genetic data to capture other information. For example, a model has to look at a wide enough span of information to capture the signals spread out across a chromosome — but in a way that doesn't lose valuable details about mutations to single letters and the changes they cause. AI models might not be able to rely on tokenization or require adapting it to do this, Dunn says.
Where it stands: It's early days for AI foundation models in biology but companies, including Profluent Bio, Inceptive and others, and academic groups are developing models for deciphering the language of DNA and designing new proteins.
- HyenaDNA, a "genomic foundation model" developed by researchers at Stanford University, learns how DNA sequences are distributed, genes are encoded and how regions in between those that code for amino acids regulate a gene's expression.
Yes, but: Like with LLMs, there is concern about biased training data based on where samples are taken from, says Vaneet Aggarwal, a computer scientist and professor at Purdue University who has worked on AI models to understand the language of DNA.
What's next: Spewing out novel molecules from generative models is only a first step — and not necessarily the biggest hurdle, Cho says.
- Candidate molecules have to go through several more phases of development to filter out the most promising ones for experimental testing in the lab, he says.
The bottom line: LLMs that handle human language are "speeding up what we already know how to do," Cho says — but with biology, "we're trying to figure out something we've never figured out ourselves." That means "the burden of validation is ... enormous."