Yusuf Roohani, Arc’s Associate Director of Machine Learning, on designing the architecture behind a virtual cell
Yusuf Roohani spent his PhD making machine learning models to predict how cells respond to genetic perturbations. Long before the term “virtual cell” was in vogue, Roohani was building tools to simulate and steer cellular behavior in industry and then in academia.
Now, as Associate Director of Machine Learning at Arc Institute, Roohani is helping build large-scale models trained on single-cell perturbation data. His group is focused not only on predicting gene expression outcomes, but on devising broadly useful tools that close the gap between computational and wet-lab experiments.
Roohani (X: @yusufroohani) shares his perspective on the current progress towards a virtual cell, the architecture behind Arc’s model State, and what success actually looks like.
***
Your background is in computation, and you later pivoted into biology. Why?
What attracted me to biology is that it forces you to be creative not just in building tools, but in formulating the right problems. In machine learning, it’s not always about picking the best architecture to solve a known problem; but in biology, you really need to think deeply about how to define a problem in terms of a computable question. I find this challenge really interesting.
Of course, there’s a broader motivation, too, to make an impact on diseases or health. But for me, the main draw was that it’s technically compelling and there is a lot of open research territory. Maybe I got lucky, but I moved into this space right as more machine learning researchers began to recognize that biology offers real, unsolved problems.
My first industry job was at a company called Merrimack in Cambridge, where we tried to model cellular networks to predict which patients might be resistant to specific drugs. That is what introduced me to systems biology and transcriptional networks. Later, at GSK, I worked on high-throughput screening, but from an imaging perspective. We’d perturb cells—either genetically or with drugs—and then quantify phenotypic changes. That work helped me think more systematically about perturbation experiments and how to use them to model cellular behavior.
How did your approach to cell modeling evolve during your PhD?
By the time I began to pursue my PhD at Stanford, I saw a paper from the Weissman Lab that had generated combinatorial Perturb-seq data and it was easy to see the potential this area offered from a computational perspective. I began to wonder how I could use those datasets to make useful computational models that predict the outcome of untested perturbations. So I built a deep learning model, called GEARS, which predicts transcriptional responses to both single and multigene perturbations using single-cell RNA-sequencing data.
I’ve always been most interested, though, in models that don’t only make predictions, but can actively guide experiments. I want to create things that help close the loop between computation and wet-lab work. We were starting to build toward that near the end of my PhD, and then Arc reached out. They were thinking along exactly the same lines: using predictive models to simulate cell behavior and, eventually, to steer real-world experiments. The alignment was almost uncanny.
State predicts how a cell’s gene expression patterns will change in response to a single chemical or genetic perturbation. Do you think this model can eventually learn to generalize to combinations of perturbations?
I don’t think it will happen automatically. Combinatorial perturbations are incredibly valuable datasets, but they’re also much harder to design. You’re dealing with a space of potentially 20,000-by-20,000 gene combinations, and most of those won’t yield interesting interaction phenotypes but that’s something we definitely plan to explore in the future.
That said, I do think the system we’re building now will make it easier to incorporate combinatorial data later. State is designed to learn a general understanding of gene regulatory architecture, like how genes relate to one another across different cell types, cell lines, and conditions. We’re already seeing signs that this is working. In our recent paper, for example, we pre-trained the model on the Tahoe 100M drug perturbation dataset and then adapted it to a different dataset composed of genetic perturbations. There was no overlap in the perturbations between the two datasets, but the model was able to actually learn gene-gene relationships and broader regulatory patterns across datasets, even when the perturbations and modalities were entirely different.
That kind of transfer learning is promising. So while the model won’t predict combinatorial effects “for free,” it will hopefully require far less new data to get there later on.
You’ve mentioned that you want to close the loop between computational and wet-lab biology. How will you know if a virtual cell has been able to do that?
There are basically two main layers that I think about with AI models. The first layer is their ability to represent biology; to take large-scale biological data and try to model systems or cell states in a way that helps us understand how biology works.
The second layer—where things are starting to get more interesting—is using those models to actually design experiments. Right now, our model, State, is mostly focused on the first layer, of creating a better representation of cellular state and behavior. And in that setting, even if a human is still the one interpreting the model and deciding what to test next, the model is already helping to surface meaningful possibilities.
But going forward, I’d like to see much more investigation into what these AI models have actually learned so we can translate that into experimental plans. That’s actually a non-trivial step. Just because a model can predict gene expression changes doesn’t mean it can tell you what perturbation to do next. There’s still a gap between predictive modeling and experimental design.
The next version of our AI model of cell state is still focused on scaling up along the context axis; we’re training across broader biological conditions, improving architectures, and making the models more robust. But ultimately, the goal is to connect those representations with tools that can reason over them and help scientists actually decide what to do next.
A long-standing challenge with building predictive models in biology is that the underlying data is not standardized. How does State account for that?
There are two types of heterogeneity in biological data. First, there’s variation within a single experiment, or between the cells in a population. When we perform a perturbation experiment, we destroy the cells in the process of measuring them, so we never see the same cell before and after. Instead, we end up with two noisy populations—one perturbed, one unperturbed—and try to infer what happened in between. State is designed to reason over that population-level variability, rather than trying to treat each cell in isolation.
The second kind of heterogeneity comes across experiments. For example, someone in the United States might use one protocol to collect Perturb-seq data, while a lab in Singapore uses another. Even if they're studying the same cell type, the batch effects and experimental artifacts can be big. That makes it hard for models to generalize across datasets. So we wanted an architecture that could address both of these challenges.
So with State, instead of feeding individual cells into the model one at a time, we feed in entire populations. The model sees all the cells from a given condition at once, and it’s free to learn patterns from the full distribution. It uses attention across the population to infer the effects of perturbation. That sounds obvious in hindsight, but it turns out that most models were still training on single cells in isolation.
The second idea was to move away from raw gene expression counts, which are highly sensitive to technical noise, and instead use embeddings, or machine-learned representations of each cell that are generated by a separate foundation model. The idea is that by pretraining on large, diverse datasets, we can build more robust cell representations that retain biological signals while filtering out some of the experimental variability. Together, these two ideas allow State to handle heterogeneity better than earlier approaches.
Do you think it will take decades until we have a clearly useful virtual cell model?
I’m not sure about the exact timeline but I do think this is going to be a long process. We’re still at the stage of figuring out what the right problems and metrics are for a virtual cell.
Also, useful models don’t need to have perfect predictive accuracy. We’re not aiming to build a complete simulator of the cell. But if these models can narrow the search space over potential wet lab experiments even a little, or help scientists decide which genes to perturb or which cell types to study, then that alone can have a big impact.
