Alex Dobin, Arc’s Bioinformatics Director, on his journey from theoretical physics to virtual cells
“I’m kind of self-taught. I took a biology course in high school, but not much else.”
Alexander Dobin’s (X: @a_dobin) career journey makes sense only in retrospect. After a PhD in condensed matter physics, Dobin wrote computer simulations to model magnetic materials used in computer hard drives. From there, he moved into biology and taught himself bioinformatics despite never taking a course on the subject.
A lack of formal training clearly wasn’t a barrier. Shortly after making the switch, Dobin developed STAR and STARSolo—two of the most widely-used tools for analyzing data from RNA-seq experiments. And today, he is Director of Bioinformatics at Arc Institute. Dobin’s team is analyzing data and building improved algorithms for Arc’s Virtual Cell and Alzheimer’s Disease Initiatives.
His journey to now is made more compelling, perhaps, by the fact that he did it without a formal biology education. “There is this bible, Molecular Biology of the Cell, that you’re supposed to read,” he says. “And I tried a few times, but I just don't have time.”
In the late 1990s and early 2000s, Dobin was a Ph.D. student at the University of Minnesota, working on condensed matter physics. What began as a series of “pure theory” projects slowly morphed into computation. Dobin built a series of computational models for his Ph.D. thesis to predict how chromia (Cr2O3) changes under intense pressures, and later ran hundreds of thousands of lines of Fortran code to calculate band structures for LaNiO3 films, often used as electrodes in devices requiring precise voltage actuation.
After finishing his Ph.D. in 2003, Dobin moved to SeaGate Technology, one of the world’s leading manufacturers of computer hard drives. There, he made simulations of magnetic materials to create hard drives that are more data dense and able to more quickly write and erase digital information. The simulations were extremely expensive to run, says Dobin, because they involved lots of quantum mechanics equations. At the time, even the most advanced computers could only simulate the behaviors of a few atoms at once.
Five years into his work on hard drives, Dobin grew restless. Hard drive storage capacities were improving by 5 or 10 percent each year, but the entire problem of what he was working on—namely, information storage—felt like “a very limited area,” Dobin says. “Nobody really cares about it. But biology—living healthier, longer lives—just about everyone cares about.”
Around 2007, Dobin began reading news articles about human genome sequencing. The Human Genome Project had wrapped in 20031. At the time, scientists were using that reference genome to make all kinds of discoveries about cancer and other diseases. But there were also better sequencing technologies coming online, and huge projects being launched to use those technologies at scale, such as the 1,000 Genomes Project. Dobin thought he might be able to apply his computational skills to this problem of swelling genome data, so he began looking for jobs in biology labs.
And that’s when he encountered Thomas Gingeras, a professor at Cold Spring Harbor Laboratory and a pioneer in RNA-sequencing. It was Gingeras who talked Dobin into moving to the East Coast to work on interpreting all the data his group was generating with microarrays. Dobin joined Gingeras’ laboratory in 2008 and headed east.
Later that year, RNA-sequencing went mainstream. Barbara Wold’s laboratory at the California Institute of Technology published a Nature Methods paper on “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” outlining how scientists could sequence and quantify the full transcriptome of mammalian cells by converting RNA into cDNA and then loading that cDNA onto next-generation sequencers.
“And I remember building my first simple script to analyze these RNA-seq data,” Dobin says. “I ran it in the browser. We could see these exons lighting up really clearly, showing the splicing structure. This was something you couldn’t do with microarrays. It was just an eye-opener for me.”
While in Gingeras’ laboratory, Dobin began working on a new algorithm to process RNA-seq data. At the time, several scientists believed that human cells might contain chimeric RNA molecules—literally two strands of RNA, expressed from distinct parts of the genome, that splice together. Chimeric RNAs had been found in C. elegans and fruit flies, but had not yet been found in humans. (Trans-splicing in vertebrates is rare, and whether or not healthy human cells make chimeric RNAs via trans-splicing is still somewhat debated.)
In 2008, during his search to find chimeric RNAs, Dobin began writing the computer script that would later become STAR. He finished the project quickly but used the program only for his personal use for several years. He finally released it in 2012. STAR has now been cited more than 50,000 times, and is one of the most successful bioinformatics tools of all time—even though it was initially created for Dobin’s own use. In 2021, Dobin also released STARsolo, a bioinformatics script for analyzing single-cell RNAseq data.
Later, though, Dobin realized that the same skills he had learned in physics working on hard drives would also be relevant for leading a team at Arc Institute working on virtual cells. His training in physics, for example, taught Dobin “how to look for patterns and logic in the data, and how to separate important factors from nuisances,” he says. “My later work was heavily computational, teaching me a lot about algorithms and coding, computation efficiency, and data structures.”
All of these skills now come to bear at at Arc Institute, where Dobin’s time is mainly spent making new algorithms and tools for two major projects: building a highly predictive and general virtual model of human cells, and devising a model to find possible drug targets for Alzheimer’s disease.
RNA—Dobin’s area of expertise—happens to be the “abstraction layer” of the virtual cell effort. Arc Institute is profiling the transcriptome of hundreds of millions of cells before or after perturbations to understand how drugs or genetic mutations change their gene expression patterns. These data are then supplied to the Bioinformatics and Machine Learning teams (Dobin leads the former) to train the model. The amount of data required for this effort lies far beyond what a typical academic laboratory can generate, and also demands better data workflows.
Large-scale projects like these are part of why Dobin moved to Arc Institute. “At some point, I just couldn’t force myself to write another grant application, and I never much liked writing papers, either. I just felt like I was spending so much time doing things that were not related to my science,” he says.
“But I also wanted to work on these massive projects and collect these huge, high-quality datasets that just wouldn’t have been feasible otherwise. The drive at Arc Institute is to do good science—to find the truth and cure diseases—and that resonated with me.”
Footnotes
-
A full, gapless human genome sequence was not finalized until 2022, and was only made possible thanks to developments in nanopore sequencing technologies. ↩
