Arc Virtual Cell Atlas launches, combining data from over 300 million cells

Arc Institute today launched the Arc Virtual Cell Atlas, a growing resource for computation-ready single-cell measurements, starting with data from over 300 million cells. The initial release of the Atlas is Arc’s first step toward assembling, curating, and generating large-scale cellular data to fuel new insights from AI-driven biological discovery.
The Atlas debuts with two foundational datasets, both of which will be publicly available starting February 25, 2025. The first is a new, open source, perturbation dataset called Tahoe-100M, created by Vevo Therapeutics, comprising 100 million cells and mapping 60,000 drug-cell interactions across 50 cancer cell lines. The second dataset, scBaseCamp, is the first single-cell RNA sequencing dataset from public data to be curated and reprocessed at scale using AI agents. Arc mined observational data from more than 200 million cells representing 21 different species sourced from public repositories, and processed them to a standardized form.
"Single-cell sequencing has transformed our understanding of cellular biology, but until now, integrating and analyzing data at scale has been a major challenge," says Dave Burke, Arc Institute's Chief Technology Officer. "The Arc Virtual Cell Atlas represents a new approach that combines diverse datasets spanning multiple species, tissues, and experimental conditions, making them accessible for advanced computational analysis and discovery. And it’s available to use today."
To develop scBaseCamp, Arc scientists developed AI agents to automatically identify and curate single-cell datasets from public repositories, processing them through a standardized pipeline that ensures consistency across different experimental sources. This uniform approach enables better integration between different types of single-cell data and provides researchers with a dataset optimized and tailored for virtual cell model pretraining.
"scBaseCamp represents a fundamental shift in how we approach biological data integration," says Yusuf Roohani, Arc’s Group Lead for Machine Learning. "We've built autonomous AI systems that continuously search, curate, and process public data repositories in real-time. Rather than relying on manual curation, our agents are working in the background, identifying and standardizing new single-cell datasets as they become available."
Both scBaseCamp and Tahoe-100M are publicly available on an open source basis. Researchers can leverage this resource to study drug responses and disease mechanisms across diverse contexts.
"One of our major goals at Arc is to build virtual cell models that can simulate human biology in health and disease. By standardizing and integrating hundreds of millions of cellular measurements into computation-ready formats, we're removing a critical bottleneck in applying AI to biological discovery,” says Arc Co-Founder and Core Investigator Patrick Hsu. “This resource will enable researchers to train more sophisticated machine learning models that can better capture the complexity of cellular behavior across diverse contexts."
“We look forward to scientists using these datasets and are happy to collaborate with others creating single cell data, especially perturbative single cell data, who want to contribute to the Atlas,” says Arc Executive Director, Co-Founder, and Core Investigator Silvana Konermann. "We are building a dynamic platform that we hope will accelerate biology research on many fronts.”
Looking ahead, Arc plans to expand the Atlas through additional research efforts and partnerships. Future releases are expected to incorporate a variety of single-cell data types – with a focus on data that incorporates both chemical as well as genetic perturbations to be able to systematically map their impact on the transcriptomic landscape of diverse cellular contexts and experimental conditions. The Institute will also enhance analysis tools and visualization capabilities of the Atlas portal based on researcher feedback.
The Arc Virtual Cell Atlas is now accessible on this portal: https://arcinstitute.org/tools/virtualcellatlas