Arc's Virtual Cell Initiative

Complex diseases are difficult to understand and treat because they involve combinations of factors that interact to influence disease risk. Experimentally, it is highly challenging to capture the full extent of combinatorial possibilities across relevant cell types to fully understand shared mechanisms and pathways, a key step towards systematic target identification for therapeutics development.

Virtual cell models offer a path to accelerate the study and treatment of complex diseases by deeply understanding the dynamic nature of cellular state across cell contexts. By learning how cellular gene expression and behavior shifts in response to chemical, genetic, or environmental changes across cell types, virtual cell models can begin to predict how we can nudge a cell from a diseased state to a healthy one.

Arc’s Virtual Cell Initiative (VCI) focuses on the full-stack development of an accurate virtual cell model to advance complex disease research. From generating rich perturbational training data at massive scale, to curating publicly available datasets for uniform processing, to evaluating and implementing new model architectures, Arc is building a model that aims to accelerate the identification of causal pathways and the nomination of new treatments.

Flagship Projects

Arc seeks to create momentum for building and evaluating virtual cell models in the scientific community. In addition to sharing our progress through regular releases of data and models, we host an annual competition to catalyze the field and spark a conversation about rigorous standards for assessing how well AI models simulate cellular behavior.

Arc Virtual Cell Atlas

A comprehensive collection of curated, high quality, open single-cell datasets that form a foundational training dataset for virtual cell models. The Atlas’ foundation is scBaseCount, our agentic framework deployed to collate, curate, and re-process the entirety of publicly available single-cell datasets. Additional contributions include the Tahoe-100M dataset from Tahoe Therapeutics and the Virtual Cell Challenge’s perturbation benchmark datasets.

Latest scBaseCount release
Atlas launch announcement

STATE virtual cell model

Our first-generation virtual cell model, STATE captures the cell type specific effects of genetic, chemical, and cytokine perturbations on the cellular transcriptome.

STATE was trained on observational data from 167 million cells and perturbational data from over 100 million cells across 70 human cell contexts. It combines a State Embedding (SE) module that creates an organized map of cell states, and a State Transition (ST) module that predicts how gene expression shifts in response to drugs or genetic changes.

Latest STATE release: Version 1.0
STATE launch announcement

Virtual Cell Challenge

An annual machine learning competition with a $100,000 grand prize. Participants are invited to build models that predict how gene expression shifts in response to perturbations.

Sponsored by NVIDIA, 10x Genomics, and Ultima Genomics, the competition convenes researchers from around the world with diverse areas of expertise around a shared challenge.

Inspired by the impact of CASP on the field of protein structure prediction, Arc provides annually new high-quality benchmark perturbation datasets and a live leaderboard inspiring different teams to develop unique solutions to push the field forward towards meaningful levels of accuracy on this challenging task over the coming years.

Virtual Cell Challenge website
Virtual Cell Challenge 2025 wrap-up: Winners and reflections (December 2025)
Behind the data of the Virtual Cell Challenge (August 2025)
Virtual Cell Challenge launch announcement (June 2025)
Virtual Cell Challenge: Toward a Turing test for the virtual cell (Cell, June 2025)

Article Image

Open Roles

Arc is looking for talented individuals to contribute to this ambitious initiative. Learn more on our jobs page or sign up to receive bimonthly job alerts.