How Software in the Life Sciences Actually Works (And Doesn’t Work)
By Elliot Hershberg. Published 2022-01-30.
Elliot is a PhD student at Stanford University. Before graduate school, he worked on a range of problems in biotechnology. He has helped design cancer vaccines, built computational tools for advancing imaging technologies, and worked as a software engineer on a modern genome browser. Elliot also writes a weekly newsletter called The Century of Biology.
Genomics is projected to require up to 110 petabytes (PB) of storage a day within the next decade—for reference, if you were storing all of that data on 1TB hard drives, you’d need 110,000 of them per day. This would make genomics one of the largest data generating endeavors on the planet, topping other contenders such as Youtube (3-5 PB/day), and Astronomy (3 PB/day). Genomes are not the only “-ome” being measured at breakneck speed, as multi-omic studies are now generating enormous catalogs of different cellular measurements. High-resolution microscopes routinely generate terabytes (1012 bytes) of rich spatial data. In the modern life sciences, quality analytical software is utterly essential for making sense of this enormous amount of new data. Biology is changing.
While biology is changing, our leviathan funding structures have remained the same. It has been estimated that tens of billions of dollars worth of global life sciences R&D depends on software tools for analyzing new types of biological data like genome sequences. While the National Institutes of Health (NIH) is uniquely positioned to support scientists developing this crucial infrastructure–with a budget of $51.96 billion this year–direct grant opportunities for this type of work remain practically nonexistent. Because of the lack of direct funding opportunities, many of the most widely used analytical software tools compete for smaller philanthropic support such as the $11.1 million dollar Essential Open Source Software for Science grant opportunity from the Chan Zuckerburg Initiative.
Part of the absence of sustained software funding in the life sciences can be explained by studying history.
The field of genomics was launched by the Human Genome Project (HGP), which was an effort to produce the first map of the DNA bases of all human chromosomes. One of the central aspects of the story of this project is that it was completed in a close race against Celera Genomics, a private company aiming to gain patent protection over large portions of the human genetic code. In a heroic effort, a UC Santa Cruz graduate student and world-class hacker named Jim Kent wrote a software tool called GigAssembler that was required by the HGP to release the first publicly available map of the human genome. Kent’s graduate advisor, David Haussler, reflected on his effort saying, “Jim in four weeks created the GigAssembler by working night and day. He had to ice his wrists at night because of the fury with which he created this extraordinarily complex piece of code.''
Kent and many other incredibly talented scientists created a large body of tools that enabled genomics to continue to grow. Inspired by the free software movement, the vast majority of these tools were made freely available along with their source code. While this innovative work propelled science forward, it made it possible for funding agencies to maintain their status quo of largely ignoring software. The computational biologist Adam Siepel summarizes this saying: “Institutions have not been forced to pay professional programmers competitive salaries; grant agencies have not been compelled to set aside appropriate funds for a software infrastructure; and the line items for professional software engineering have not made it into budget models. Thus, genomics has become accustomed to, even addicted to, abundant free software. In a sense, in our idealistic, anti-establishment zeal, we free software warriors have locked computational genomics into an unsustainable financial model.”
The short-term panacea of abundant free software in the life sciences has allowed funding agencies to carry on with the existing investigator-based grant model, without ever rigorously evaluating how to fund software development directly.
In practice, this means that most life sciences software development happens in academic labs. These labs are led by principal investigators who spend a considerable portion of their effort applying for competitive grants, and the rest of their time teaching and supervising their trainees who carry out the actual research and engineering. Because software development is structured and funded in the same way as basic science, citable peer-reviewed publications are the research outputs that are primarily recognized and rewarded. Operating within this framework, methods developers primarily work on building new standalone tools and writing papers about them, rather than maintaining tools or contributing to existing projects.
As computational tool builders progress through their academic career, they accrue an ever-increasing amount of maintenance burden as they build and publish new standalone methods. For example, I recently developed a tool for designing probes for microscopy experiments. Although it currently works and supports a good number of researchers, there are several ways that the software could be refactored to make it more durable long-term with less active maintenance. Unfortunately, as is common with young researchers, I have since moved on to work at a different lab and institution. Now that the work is published, there is no incentive, funding, or time allocated to do the important maintenance work. This is not an exception to the rule, it’s an illustration of the typical environment in which biomedical software is primarily developed and maintained.
This organizational structure for developing methods and software has resulted in a tsunami of unusable tools. A recent study found that from a sample of recently published computational tools, only half were found to be “easy to install”, and nearly a third were no longer installable from the URL provided in the original publication. Scientists need to learn how to download and install a large number of executable programs, battle with Python environments, and even compile C programs on their local machine if they want to do anything with their data at all. This makes scientists new to programming throw up their hands in confusion, and seasoned programmers tear their hair out with frustration. There is a reason why there is a long-running joke that half of the challenge of bioinformatics is installing software tools correctly, and the rest is just converting between different file formats.
How can we fund the development of better software tools for scientists?
One approach to funding the continued development of software is to sell it to paying users. A successful example of this approach in the life sciences is Benchling, which has produced tools such as their electronic lab notebook and registry system for lab reagents that have the quality and coherence we have come to expect from consumer software products. They have pursued a steep Robin Hood-esque pricing model, with SaaS costs for startups and pharma corporations and a free offering for academics. Encouragingly, new startups are beginning to follow in their footsteps such as LatchBio, which aims to create a web-first platform for orchestrating bioinformatics pipelines on the cloud without coding. This illustrates one revenue model that could equip life scientists with modern and robust software tools for free without making the NIH responsible for the bill.
While venture-backed startups can deliver quality tooling, there are many important areas in biological software that don’t have large commercial markets to offset costs for academic users and to incentivize capital investment. This is because scientific software development is genuinely a domain of research, requiring close interaction with scientists and new measurement technologies long before there is a clearly identifiable market for the tools being built.
We are not limited exclusively to our existing academic model or the traditional VC model. By studying the history of our field, we can identify the random contingencies that have shaped the institutions that we inhabit today, and envision new systems for the future. In 2021, we have seen a Cambrian explosion of new models for funding and organizing science—such as Arcadia Institute, Arc Institute, and New Science–where you are reading this post. New funding models such as Focused Research Organizations (FROs) and Private ARPAs (PARPAs) represent new ways to address gaps present in our current research ecosystem. These new institutions and funding structures have the potential to have an enormous impact on biological software.
To start, we could dramatically reduce the friction associated with establishing teams of professional research software engineers (RSEs), designers, and product managers to build systems far outside the scope of what a single graduate student could build when developing a standalone method.
I learned about both the importance and the difficulty of this type of work while working as a RSE on the JBrowse project. The JBrowse team builds, ships, and maintains the software tools and infrastructure powering a large number of model organism research communities, such as WormBase and FlyBase, and is used by many other researchers in human genetics and cancer biology. Despite the broad user base JBrowse supports, funding is precarious. The project is supported by a patchwork arrangement of government and philanthropic grants subject to the gauntlet of competitive renewal.
In the funding environment I have described, many researchers are surprised to learn that we manage to get the necessary support for the team structure that we have. In response to a Reddit thread detailing some of the woes of software in biology, our project lead Ian Holmes responded to some of this surprise, saying: “It is built using React and other modern web standards, by a team of professional research software engineers (RSEs) working out of an academic lab (mine), and funded by NIH. Just mentioning this because, by the comments on this post, one would think this confluence of events akin to a minor miracle, like a two-headed calf, or a build without warnings. But it is possible.”
What if securing the funding for this type of project from the NIH didn’t feel like pulling teeth, but was instead a respected and standard model in the world of biomedical research? What if, instead of 5 engineers responsible for building a web app, a desktop app, test infrastructure, release infrastructure, documentation, posters, and publications, we could easily grow to a team of 10 or 15? What if we could hire dedicated product designers and infrastructure engineers with experience at top Silicon Valley startups? What if professional RSEs paid industry competitive wages were embedded across labs and departments, providing a new and valued career path for tool builders?
These possibilities represent only a small sample of the total space of new ways to fund and organize software development in the life sciences. We should proceed by studying examples of successful scientific software projects, listening to existing associations and societies of RSEs, and exploring entirely new types of institutes. Charting a path towards a more stable and productive scientific software ecosystem will require a deeper analysis of our current incentive structures in order to identify opportunities for constructive change and realignment.
If the NIH doesn’t adapt to the changing demands of biological research, we need to harness the new institutes and funding models being created to make the world of biological software development more robust. In the tech world, there is an increasing amount of concern about the lack of innovation happening in the world of atoms compared to the world of bits. Biotechnology is an extraordinary counterexample to this trend, with innovation in the physical world happening at breakneck speed. Genome sequencing, CRISPR gene editing, and recombinant DNA technology are being mixed and matched in mind bending ways—offering the promise of an abundant future. My argument is simple: software development is not only a valid way of doing science, it is a foundational part of the modern life sciences. If we want our most talented designers and programmers to build tools to accelerate biotechnology instead of innovating purely in the virtual world, we need to create stable and rewarding career opportunities for this type of work. Choosing to build scientific software tools should not require career sacrifice. A prosperous future in the physical world may depend on it.
Cite this essay:
Hershberg, E. “How Software in the Life Sciences Actually Works (And Doesn’t Work).” newscience.org. 2021 January. https://doi.org/10.56416/912lqs
Want to write for us? We are looking to fund more essays like this. Let us know here.