Arc Core Investigator, Hani Goodarzi, on his decades-long quest to build a virtual cell

Hani Goodarzi, Arc Institute Core Investigator

Hani Goodarzi, a Core Investigator at Arc Institute since 2023, has been trying to build virtual cells for more than two decades. His fascination with computational biology and AI date back to his graduate school days, when he first tried (and failed) to build a working virtual cell model by hard-coding individual components, such as metabolic networks and regulatory gene networks.

Today at Arc Institute, and as an Associate Professor at the University of California, San Francisco, Goodarzi is using everything he's learned—alongside advances in machine learning and high-throughput functional genomics—to finally bring this long-held dream to fruition.

Goodarzi (X: @genophoria) shares his group's work on Arc's virtual cell efforts and his own career journey below.

***

Why should cell biologists working experimentally, especially those without a computational background, care about a virtual cell?

Our virtual cell model, State, is all about perturbations. It is able—given a gene expression profile of a single cell—to predict how that gene expression will change after a perturbation, such as a drug or genetic mutation.

Cell biology is ultimately about perturbations, too. My laboratory is constantly manipulating cells; we mutate genes, we overexpress or reduce a gene's expression, we expose cells to drugs, right? That's how you understand function and causality. Without these perturbations, research in biology would just be correlations.

A good virtual cell model, then, ought to be precise enough that it negates the need for a scientist to go into the lab and do actual experiments, at least until you have identified a strong candidate therapy. And that is why scientists should care.

The utility of a virtual cell might be similar to what we've seen in protein folding models, such as AlphaFold. For those models, you basically put in an amino acid sequence and get a structure back out. Structural biologists can go into the laboratory and solve protein structures manually, of course, but AlphaFold has become so precise that it's easier to just use the model to predict protein structures instead.

This is similar to our goal with the virtual cell. Except instead of predicting the structure of proteins, we're predicting the gene expression profiles of cells.

Why does a cell's gene expression matter? What does that say about a cell's function?

It says a lot about a cell's function. If you asked a neuroscientist twenty years ago: "What is a neuron?" they would answer by describing its function and lineage. They would explain how the neuron fires in response to various signals, and so on. It used to be that scientists would check a cell's function by, say, engrafting it into a mouse and showing that it continues working.

But today, neurons—and all other cell types—are also defined according to their transcriptomic state. A neuroscientist today would identify a neuron simply by looking at its gene expression patterns.

So as cell biologists, we do still care about the function of cells, and how those functions change across development or in the course of disease. But we think that a virtual cell model that is able to understand and predict gene expression dynamics will also say something important about the cell's function.

Proteins, for example, have a structure-function relationship. If two proteins look the same way—even if their sequences are different—then they probably perform the same function. AlphaFold is important because it says something about a protein's function by predicting its structure. Similarly, in cell biology, if two cells have the same gene expression patterns, then they probably have the same function, too.

How are the data being collected to train this virtual cell model?

Arc's virtual cell is trained on single-cell RNA-sequencing datasets of two types: observational and perturbational.

Observational data is just RNA sequences collected from unperturbed cells. These data are not as useful as perturbational data. Imagine if you had a library of books from 200 authors, and you wanted to use them to train a large language model. Well, after the model is trained on thousands of words of Shakespeare, additional words from Shakespeare have diminishing returns. The same applies to a virtual cell; once a model has seen RNA-seq data from a million T-cells, seeing RNA-seq data from another T-cell is not so valuable.

Each data point from a perturbed cell, though, says something fundamentally new about biology. Perturbing a cell—using a drug or genetic mutation—moves that cell into a new transcriptional "state." And in biology, perturbations are also how you prove causality. If I want to see whether gene A and gene B are related, then I'd probably need hundreds of data points from observational experiments to fit a line, right? But if I do one perturbation experiment, then in principle, a single data point is sufficient.

So at Arc Institute, we are really focused on collecting huge perturbation datasets of single cells. That is how we are training the virtual cell model.

The way that we study cells today, though, requires that we first destroy the cells. We're not able to collect data on the same cell both before and after perturbation. Instead, we do RNA-seq on a population of cells without perturbation in parallel to the RNA-seq performed on perturbed cells. Each perturbation is either done by exposing cells to a drug, or by using a CRISPR system to knock out, repress, or activate a gene inside the cells.

Separately, other investigators at Arc are working on ways to measure gene expression in cells without destroying them, such as by coaxing cells to export their RNA molecules and then sequencing those molecules over time. Although it's still in development, I'm excited about the promise of that technology as another tool in our toolkit.

How do you know if a virtual cell model is good or not? Presumably there are metrics to benchmark how accurate the model's predictions are…

Yes, at the same time we're building this virtual cell, we are also carefully building the benchmarks and evaluations needed to track our progress towards the goal. And as we make more progress, our benchmarks will evolve and become more nuanced.

Importantly, though, we don't only care about the predictive power of the virtual cell. We also care about what the model is learning. We're making a Virtual Cell so that we can later dissect it, using in silico methods, to understand what the model has learned.

For example, if a cell has a piece of RNA that is regulating the expression of Gene A, and the model predicts that the expression of that RNA changes when you remove a transcription factor binding site, then that's useful information! Without needing to go into a cell and physically delete that transcription factor, the model has learned that this piece of DNA matters.

A good model should be able to learn the first principles of biology and apply them to problems it has never seen before. A good model needs to be able to generalize to things that it hasn't seen, at a level of complexity we can't accomplish on our own as humans. That's really the true test of a good AI model. The closer you get to that dream, the better.

So at some point, the scientific community will decide that a Virtual Cell is good or useful by using it consistently, and by trusting its predictions, in the same way we now trust structural predictions made by AlphaFold.

What does V2 or V3 of this virtual cell look like?

We're definitely at a GPT-1 sort of moment. But ultimately, what we want is a foundation model for cell biology. We're training this virtual cell on one type of data, but in the future this virtual cell will act as a foundation that can then be supplemented with other types of data.

A developmental biologist might fine-tune the base model, for example, to study how embryos grow. In general, I think the virtual cell will likely evolve in the same way that the language models did. GPT-4 is a foundation model for language, but users are creating all kinds of tools on top of it. You can interact with the base model using chat—that's ChatGPT—or turn it into a reasoning model, or use it as a coding model.

After we build a foundation model, then, we can start incorporating multimodal data. But to be clear, just building that first foundation model will require tons of data. So for cell biology, we're starting with a type of data that we can collect at scale—this single-cell RNA-seq data—before moving into data modalities that are more difficult or expensive to collect.

When did you become interested in virtual cell models? And why?

I've been interested in virtual cell models for as long as I can remember. I wrote code for a virtual cell model as a graduate student, but it was never published and didn't really work. My approach wasn't using neural networks; it was based on building the system from first principles and its known modules. This is despite the fact that I knew about neural networks at that time… I learned about them in the late-1990s, the last time neural networks were considered "cool." But these models had largely failed to deliver on their promise, and we had entered one of the famous "AI winters".

At the same time, when I entered my PhD in 2007, gene expression profiling using microarrays was becoming more popular. So I spent most of my time building statistical and information-theoretical tools to do high-dimensional data analysis for cancer in order to reveal the underlying regulatory programs that drive pathological gene expression in cancers, and delved a bit into virtual cells on the side.

By the end of my PhD, I was of the mind that machine learning was not getting traction in biology. Progress had plateaued and an AI winter persisted. It was clear, even then, that data would be a huge limitation for training useful models.

When I started my laboratory at UCSF in 2016, though, the circumstances had changed. Deep learning had re-emerged thanks, in part, to advances in compute and larger labeled datasets. For example, PyTorch had just come out and it seemed like AI was coming back. Functional genomics tools, like CRISPR and single-cell sequencing, were also becoming more popular. I started to realize that collecting the data required to train a virtual cell—together with new deep learning architectures—might finally make this vision feasible. So everything in my career came full circle by that point, and I've been working on AI models that can simulate cellular dynamics ever since.