The Hidden “Grammar” Revealed by CodonFM, our Foundation Models for Codons
A conversation with Hani Goodarzi and Laksshman Sundaram
Cells build proteins by reading messenger RNA three nucleotides at a time. Each three-letter codon specifies which amino acid to add next to the growing protein chain. But with 64 possible codons and only 20 amino acids, the code is naturally redundant, with multiple different codons encoding the same amino acid. For decades these synonymous codons were assumed to be interchangeable.
Certain synonymous codons appear far more often than others–and the pattern isn’t random. Genes with similar functions use similar codons and different tissues favor different synonymous variants. The challenge has been understanding why this bias exists and whether swapping one codon for another changes cellular behavior.
A collaboration between Arc Institute and NVIDIA has now developed CodonFM, a family of open source AI models that reveal the intricate grammar underlying codon choice. The work demonstrates that codon usage follows predictable, context-dependent patterns that affect gene regulation, protein abundance, and cellular function. The public release of the models was announced today at NVIDIA GTC, Washington D.C. with a preprint and an accompanying technical blog.
For CodonFM, the research teams trained two complementary model architectures on over 130 million protein-coding sequences from over 20,000 species, processing hundreds of billions of tokens. The first model, Encodon, released today, analyzes sequences bidirectionally, capturing both upstream and downstream dependencies within genes. The second model, Decodon, to be released at a later date, generates sequences autoregressively enabling directed RNA design. Together, these models outperform existing approaches, achieving multi-fold gains across multiple benchmarks and revealing the role of codon choices in clinical genetics and therapeutic design.
In this technical discussion, two of the lead scientists explore the innovations behind CodonFM, what surprised them about codon grammar, and the implications for both basic research and therapeutic development. The conversation includes:
- Hani Goodarzi (X: @genophoria), Core Investigator at Arc Institute and Associate Professor at University of California, San Francisco, whose lab focuses on computational cancer and RNA biology, and;
- Laksshman Sundaram, Director of Applied Research at NVIDIA, whose team focuses on the development of foundation models for various biological systems.
Codon bias has been studied for decades. What made this problem solvable with AI?
Hani Goodarzi: Genome language models are trained on vast amounts of genomic data, but encounter relatively few coding sequences and have limited understanding of codons. If a model can learn that "in this position within this coding sequence, I would expect codon A versus codon B," that itself means there's an underlying grammar, a language to it that the model is learning.
As for timing, it’s one of those interesting stories of scaling laws in AI. We first attempted this work four years ago, using just 2 GPUs in the lab and a 70-90 million parameter model with early versions of BERT architecture. At the time, performance around predicting synonymous codons was limited– we could do some of it, but the signal wasn’t strong enough to show clear success that the model was learning. So we initially thought maybe we're actually just wrong.
When my lab moved to Arc, we had more resources in terms of compute, and thought we should have another go at this. We tried larger models—600 million parameters, then 1 billion—and we actually saw that, especially with 600 and then certainly with 1 billion, we could do a pretty good job of distinguishing synonymous codons.
We released a version of these models last year, but saw there was a lot more to do. That's when we joined forces with Laksshman and his amazing team. They helped make the models more efficient, retrain them, and train even larger models to see if we've topped out the scaling laws. They just keep sending me more and more impressive results.
Laksshman Sundaram: My team is very interested in finding problems that have enough data where scale matters in biology today. Here we have around 130 million coding RNA sequences from 22,000 different species, comprising hundreds of billions of tokens. That's a rich dataset to explore with AI. I know Hani from my graduate school days when he was like a mentor to me, so naturally we had a lot of engaging scientific discussions. We converged on seeing if we can push the boundaries of understanding rules of codon optimization, both for clinical diagnostics and for better mRNA designs.
Can you explain the two architectures that make up CodonFM and why you built both?
Hani Goodarzi: The CodonFM family use two approaches suited for different applications, both trained at varying sizes and scaled to billions of parameters:
-
The Encodon models use a BERT-style architecture–a bidirectional encoder that processes the entire sequence at once. They were pretrained using a masked language modeling objective, where some codons are randomly masked and the model must predict them. This allows the model to learn the contextual interplay of codons. These models excel at tasks like predicting the functional effects of mutations, where you need to understand the full sequence context. We have two variants of these models based on how we mask the codons, and they lead to different learning outcomes for the model.
-
The Decodon models use an autoregressive GPT-style architecture—a unidirectional decoder that predicts the next token in the sequence. They were pretrained with a causal language modeling objective, learning to predict the next codon based on the preceding sequence. This architecture is designed for generating new sequences, such as optimizing codon usage for therapeutics. We are currently planning on releasing the Decodon models later this year.
What surprised you most about what the models learned?
Hani Goodarzi: A clear grammar emerging from our ability to predict which synonymous codons will be selected. If codon choice was truly random, the model should not be able to predict specific codons at all. However, the larger models appear to capture long-range dependencies between codons, revealing links between codon choices, pathogenicity and protein functionality. We're now validating some of these findings experimentally, but it's clear that there is a language to codon composition that we have collectively ignored.
I’ve been interested in this topic for a long time. As a graduate student in collaboration with a friend of mine, Hamed Najafabadi, who's now faculty at McGill, we showed that genes that have similar functions or belong to similar pathways have similar codon usage patterns. This observation really motivated my work in understanding the role of tRNA in gene regulation and codon usage itself. The hypothesis that I and many others in the field have been pursuing is that codon usage itself acts as a regulatory mechanism, which we haven't really truly understood or uncovered.
Laksshman Sundaram: I was most surprised that the model trained only to learn the underlying grammar of codon usage could infer clinical effects of synonymous variants. Predicting the impact of synonymous variants has traditionally been hard because these mutations don’t change the protein sequence, and their effects on gene translation are often too subtle for models to capture. When we tested CodonFM on this task, we were impressed that the model was highly performant even in zero-shot settings where the model had no additional training for variant prediction.
This is also the first time we are seeing evidence of scaling laws of model parameters to the performance on downstream mRNA sequence tasks. We find that the higher parameter models tend to capture the codon usage rules and the grammar much better than the lower parameter models, and we see this improvement quite consistently across the board on multiple benchmarks.
What were the biggest biological and engineering challenges?
Laksshman Sundaram: Too many! At least 80% of the project has been making sure things are engineered correctly and to efficiently train these large models. All the credit really goes to Sajad Darabi and Fan Cao on my team and to Hani's graduate student (co-mentored by Mohammad Mofrad), Mohsen Naghipourfar, from UC Berkeley.
On top of that is setting up careful demonstrations and evaluations of your models. The biological data is quite intricate and complex. You can't just track the training loss and be done. You have to do a careful dissection of what the model is really learning and how that knowledge manifests itself in practical applications.
One example is wrangling RNA design datasets. More GC content might be better for RNA properties like stability and translational efficiency, but the models might overfit on some of those features. How do you carefully disentangle both and then also bring those signals back in, because ultimately they are also useful biological signals? Those have been quite challenging in this case.
What are the practical applications for CodonFM?
Hani Goodarzi: This is going to be an important tool in our kit. I can even envision a not-so-distant future in which we use CodonFM alongside protein language models in certain sequence optimization tasks, adding a layer of codon-aware insight that those models miss.
Codon choice has long been known to be important. For example, expressing a human protein in E. coli for purification requires adapting the sequence to match the bacterial system for efficient translation. But there's much more to it than that. The human genome has an overall pattern of codon preferences, but there's significant variation around that mean, and that variation contains information about protein function and tissue type.
What this means is you can optimize codon choice using these models without changing the protein sequence. Protein language models aren’t useful here because they can't distinguish between synonymous codons. Although theoretically DNA models could learn this grammar, in practice they can't either. Only codon-level models can capture the value of different synonymous codon choices in different positions.
This matters especially for mRNA therapeutics. RNA is transient: it enters a cell, facilitates protein production, and then degrades. The longer it sticks around and the more effectively it's translated, the longer the protein payload is there. For vaccines, you may achieve stronger or more durable expression, which can enhance vaccine performance.
But you can go even further to optimize for cell type-specific expression. Different cell types have distinct codon preferences driven by their tRNA landscapes. In a 2016 paper in Cell, we showed you could optimize the same protein sequence with two different coding sequences geared towards different cell types, achieving many-fold expression differences based on these tRNA patterns.
We don't have large datasets of tRNAs across every cell type, but codon-level foundation models capture all that information implicitly through learned codon preferences. You can fine-tune the foundation model on specific applications to get coding sequences optimized for your cell type of interest.
Can other labs or organizations start using this?
Laksshman Sundaram: Our team believes very strongly in supporting open science. We're publicly releasing foundation models, free to use for both research and commercial purposes, for use cases across multiple industries. We're excited to see how the research community extends this work.
Hani Goodarzi: Academia is the source of innovation primarily because of researchers' ability to interact and broadly publish findings as quickly as possible. NVIDIA is setting a great example by releasing open models or datasets that could be of value to everyone. This model probably wouldn't exist without NVIDIA, and the fact that they're willing to release it is amazing for researchers.
Laksshman Sundaram: Most of the training data came from public sources, so this model builds on the openness of the ecosystem. Many of the RNA design benchmarks we used originated in academic labs that shared their optimized codon sequences. CodonFM reflects that openness and extends it forward to enable the next generation of open research.
What would a CodonFM 2.0 look like?
Hani Goodarzi: This is all coming right off the press, but I think there are a lot of corners of biology that are becoming more machine-learnable with self-supervised modeling approaches. We focused on coding sequences here, but RNAs also have noncoding parts with regulatory roles. Folks have been working on models to capture those, but there's clearly more room to explore and we’re just beginning to scratch the surface.
Related Links:
- Preprint: "Learning the Language of Codon Translation with CodonFM"
- Technical blog: "Introducing the CodonFM Open Model for RNA Design and Analysis"
- These releases extend collaborations with partners, advancing open biological modeling through efforts like CodonFM to build a unified ecosystem of open models within NVIDIA Clara. See more about these models at: https://github.com/nvidia/clara.