The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).
The first release of the Atlas includes Tahoe’s Tahoe-100M and Arc’s AI agent-curated scBaseCount dataset.
Tahoe-100M was generated by Tahoe’s Mosaic platform and is the world’s largest single-cell dataset.
It contains 100m cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments. The dataset leveraged Parse Biosciences GigaLab for single cell sample preparation and Ultima Genomics for sequencing.
A preprint describing the dataset is here.
© 2025. This work is openly licensed via CC0 1.0.
scBaseCount is a continuously updated single-cell RNA-seq database that employs an AI-driven, hierarchical agent workflow to automate discovery, metadata extraction, and standardized preprocessing of Sequence Read Archive (SRA) data.
scBaseCount comprises over 230 million cells (and expanding), spanning 21 organisms and 72 tissues.
By continually discovering, annotating, and reprocessing raw single-cell RNA-seq data, scBaseCount offers an expansive and harmonized repository that can serve as a foundation for AI-driven modeling and integrative meta-analyses.
Technical report, February 2025
© 2025. This work is openly licensed via CC0 1.0.