Vevo Therapeutics Open Sources Tahoe-100M, the World's Largest Single-Cell Dataset, as the Inaugural Contribution to Arc Institute's New Virtual Cell Atlas
300 million single cell atlas now accessible to the scientific community comprised of Vevo's Tahoe-100M, mapping 60,000 drug-cell interactions, and Arc’s AI-curated scBaseCamp 200 million cell dataset
Generated using Vevo’s Mosaic platform, Tahoe-100M leveraged Parse Biosciences’ GigaLab for single cell sample preparation and Ultima Genomics for sequencing.

PALO ALTO, Calif. and SOUTH SAN FRANCISCO, Calif. — in a landmark move to advance AI-driven biological research, Arc Institute and Vevo Therapeutics announced today that they have partnered on the first release of the Arc Virtual Cell Atlas—the largest and most biologically diverse public resource for single-cell transcriptomic data across species, tissues, and experimental and perturbation conditions, starting with data from over 300 million unique cells. This data is open source and freely accessible via Arc’s website as of February 25, 2025.
The atlas currently includes single-cell gene expression data from two massive datasets:
-
Vevo’s Tahoe-100M, is the world’s largest single-cell dataset, 50x larger than all public drug-perturbed data combined. It includes 100 million cells and maps 60,000 drug-cell interactions, measuring cellular response across 50 cancer cell lines to 1,200 drug perturbations. Tahoe-100M was generated using Vevo’s Mosaic Technology, the first platform to make pan-cancer testing of drugs at single cell resolution scalable, and with support from Parse Biosciences’ GigaLab leveraging its single-cell RNA sequencing capabilities.
-
Arc’s scBaseCamp is the first single-cell RNA sequencing data repository from public data to be curated and reprocessed at scale using AI agents. This gene expression data from another 200 million cells from 21 different species was sourced from public repositories and has been standardized to ensure interoperability for optimal use by machine learning models.
“What makes the Arc Virtual Cell Atlas particularly powerful is not just its scale, but that now researchers can analyze together both observational natural cell states and cells that have been deliberately perturbed by drugs or chemicals to see how they respond,” says Dave Burke (@davey_burke) Arc Institute’s Chief Technology Officer. “We’re grateful to partner with Vevo on our first release of this resource, leveraging their large-scale Tahoe-100M cell dataset, which is crucial for developing predictive models that can simulate cellular responses to perturbations, potentially reducing years of laboratory work to computational queries that take minutes.”
“Something extraordinary happened in the last few years: emergence of AI models that can predict protein structure and function,” says Nima Alidoust (@nalidoust), Chief Executive Officer and Co-founder of Vevo Therapeutics. “Our mission at Vevo is to go a huge step further: build AI models of human cells to predict how diseased cells interact with potential drug molecules.”
“These models need massive amounts of observational and drug-perturbed single-cell data, leaps beyond what is publicly available today,” says Johnny Yu, Chief Scientific Officer at Vevo. “Our Mosaic platform overcomes this fundamental challenge; it can generate single-cell datasets such as Tahoe-100M at a scale that was not possible before.”
“We are open sourcing Tahoe-100M to help start a new movement in biological modeling that goes beyond us,” says Alidoust. “Releasing it on Arc’s Virtual Cell Atlas is the obvious choice as it aims to precisely do that.”
The Arc Virtual Cell Atlas is now accessible on this portal: https://arcinstitute.org/tools/virtualcellatlas.
Vevo’s Tahoe-100M preprint: https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1
Arc’s scBaseCamp technical report: https://arcinstitute.org/manuscripts/scBaseCamp