Arc Virtual Cell Atlas

The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 600 million cells (and growing).

First released in February 2025, the Atlas includes Tahoe’s Tahoe-100M, Arc’s AI agent-curated scBaseCount dataset, and the Virtual Cell Challenge 2025 training, validation, and held-out test perturbation datasets.

Datasets included in the Atlas

Tahoe Logo
Tahoe-100M

Tahoe-100M was generated by Tahoe’s Mosaic platform and is the world’s largest single-cell dataset.

It contains 100m cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments. The dataset leveraged Parse Biosciences GigaLab for single cell sample preparation and Ultima Genomics for sequencing.

A preprint describing the dataset is available on bioRxiv.

© 2025. This work is openly licensed via CC0 1.0.

Arc Logo
scBaseCount

scBaseCount is a continuously updated single-cell RNA-seq database that employs an AI-driven, hierarchical agent workflow to automate discovery, metadata extraction, and standardized preprocessing of Sequence Read Archive (SRA) data.

scBaseCount comprises over 500 million cells (and expanding), spanning 21 organisms and 72 tissues.

By continually discovering, annotating, and reprocessing raw single-cell RNA-seq data, scBaseCount offers an expansive and harmonized repository that can serve as a foundation for AI-driven modeling and integrative meta-analyses.

A preprint describing the dataset is available on bioRxiv.

© 2025. This work is openly licensed via CC0 1.0.

Virtual Cell Challenge Logo
Virtual Cell Challenge 2025

For Arc’s inaugural Virtual Cell Challenge, we generated a dedicated dataset measuring single-cell responses to perturbations in a human embryonic stem cell line (H1 hESC). This set of perturbations was carefully curated to span a broad range of phenotypic responses, and experimental parameters were optimized to maximize the reproducibility of observed effects.

The Atlas includes the training, validation, and held-out test perturbation datasets leveraged throughout the Challenge.

A commentary describing the dataset was published in Cell.

© 2025. This work is openly licensed via CC0 1.0.