Arc Research

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
arXiv
Goodarzi LabHsu Lab

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end.

Systemic hypoxia suppresses solid tumor growth
BioRxiv
Jain LabGoodarzi Lab

Systemic hypoxia suppresses solid tumor growth

Local hypoxia is a hallmark of solid tumors and a negative prognostic factor in the progression and treatment of cancer. Here, we showed that systemic hypoxia, in contrast to localized tumor hypoxia, decreases tumor growth in vivo across multiple cancer types and preclinical models.

Rewriting endogenous human transcripts with dual CRISPR-guided 3′ trans-splicing
Cell Systems
Hsu LabKonermann Lab

Rewriting endogenous human transcripts with dual CRISPR-guided 3′ trans-splicing

Here, we report the development of RNA-guided trans-splicing with Cas editor (RESPLICE). RESPLICE uses two orthogonal RNA-targeting CRISPR effectors to co-localize a trans-splicing pre-mRNA and to inhibit the cis-splicing reaction, respectively.

Systematic annotation of orphan RNAs reveals blood-accessible molecular barcodes of cancer identity and cancer-emergent oncogenic drivers
Cell Reports Medicine
Goodarzi Lab

Systematic annotation of orphan RNAs reveals blood-accessible molecular barcodes of cancer identity and cancer-emergent oncogenic drivers

From extrachromosomal DNA to neo-peptides, reprogramming of cancer genomes leads to the emergence of cancer state-specific molecules. Here, we systematically identify and characterize a large repertoire of orphan non-coding RNAs (oncRNAs), a class of cancer-emergent small RNAs, across 32 tumor types.

cyto: ultra high-throughput processing of 10x-flex single cell sequencing
BioRxiv
Computational

cyto: ultra high-throughput processing of 10x-flex single cell sequencing

Single-cell genomics is rapidly scaling toward billion-cell atlases, but computational analysis has become a critical bottleneck. Here we present cyto, an ultra highthroughput processor for 10x Genomics Flex single-cell sequencing optimized for production-scale analysis.

Stack: In-context learning of single-cell biology
BioRxiv
Computational

Stack: In-context learning of single-cell biology

Single-cell transcriptomics offers the promise of measuring the diversity of cellular phenotypes across species and diseases. Here, we present Stack, a foundation model trained on 149 million uniformly preprocessed human single cells that leverages tabular attention to generate representations for each cell informed by the cells in its context.

Semantic design of functional de novo genes from a genomic language model
Nature
Hie Lab

Semantic design of functional de novo genes from a genomic language model

Generative genomic models can design increasingly complex biological systems. However, controlling these models to generate novel sequences with desired functions remains challenging. Here, we show that Evo can leverage genomic context to perform function-guided design that accesses novel regions of sequence space.

Site-specific DNA insertion into the human genome with engineered recombinases
Nature Biotechnology
Hsu Lab

Site-specific DNA insertion into the human genome with engineered recombinases

Large serine recombinases can mediate direct, site-specific genomic integration of multi-kilobase DNA sequences without a pre-installed landing pad, albeit with low insertion rates and high off-target activity. Here we present an engineering roadmap for jointly optimizing their DNA recombination efficiency and specificity.

Genome-scale CRISPR screens identify PTGES3 as a direct modulator of androgen receptor function in advanced prostate cancer
Nature Genetics
Gilbert Lab

Genome-scale CRISPR screens identify PTGES3 as a direct modulator of androgen receptor function in advanced prostate cancer

The androgen receptor is a critical driver of prostate cancer. Here, to study regulators of AR protein levels and oncogenic activity, we developed a live-cell quantitative endogenous AR fluorescent reporter. Leveraging this AR reporter, we performed genome-scale CRISPRi screens to systematically identify genes that modulate AR protein levels.

scBaseCount: an AI agent-curated, uniformly processed, and autonomously updated single cell data repository
bioRxiv
Computational

scBaseCount: an AI agent-curated, uniformly processed, and autonomously updated single cell data repository

Single-cell RNA sequencing has transformed cell biology by enabling precise transcriptomic measurements of individual cells. Here, we introduce scBaseCount, a single-cell RNA sequencing database that leverages an AI agent to automate discovery and metadata extraction, and standardize data processing.