Virtual Cell Atlas | Arc Institute

Arc Virtual Cell Atlas

The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 600 million cells (and growing).

The first release of the Atlas includes Tahoe’s Tahoe-100M and Arc’s AI agent-curated scBaseCount dataset.

Access on Github

Datasets

Tahoe-100M

Tahoe-100M was generated by Tahoe’s Mosaic platform and is the world’s largest single-cell dataset.

It contains 100m cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments. The dataset leveraged Parse Biosciences GigaLab for single cell sample preparation and Ultima Genomics for sequencing.

A preprint describing the dataset is available on bioRxiv.

© 2025. This work is openly licensed via CC0 1.0.

scBaseCount

scBaseCount is a continuously updated single-cell RNA-seq database that employs an AI-driven, hierarchical agent workflow to automate discovery, metadata extraction, and standardized preprocessing of Sequence Read Archive (SRA) data.

scBaseCount comprises over 600 million cells (and expanding), spanning 21 organisms and 72 tissues.

By continually discovering, annotating, and reprocessing raw single-cell RNA-seq data, scBaseCount offers an expansive and harmonized repository that can serve as a foundation for AI-driven modeling and integrative meta-analyses.

A preprint describing the dataset is available on bioRxiv.

© 2025. This work is openly licensed via CC0 1.0.

Accessing the data

Data access instructions and scripts available on GitHub.