Engineers regularly use high-fidelity simulations to create robust designs in complex domains such as aeronautics, automobiles, and integrated circuits. In contrast, robust design remains elusive in domains such as synthetic biology, neuro-computation, and polymer chemistry due to the lack of high-fidelity models. DARPA’s Synergistic Discovery and Design (SD2) program aims to develop data-driven methods and tools to accelerate scientific discovery and robust design in domains that lack complete models.
Examples of complex systems where inventors lack complete scientific models to support their design efforts include biological systems that have millions of protein-metabolite interactions, neuro-processes that require computations across billions of neurons, and advanced materials influenced by millions of monomer-protein combinations.
These systems are part of domains that exhibit millions of unpredictable, interacting components for which robust models do not exist, and internal states are often only partially observable. In such domains, small perturbations in the environment can lead to unexpected design failures, and the number of engineering variables required to characterize stable operational envelopes remains unknown.
While domain experts remain geographically dispersed, they collectively analyze hundreds of terabytes of data to build models and refine designs. However, manually-intensive analysis of small datasets remains inefficient and often yields unreproducible results. In response, researchers have begun to outsource high-throughput experiments to automated labs and randomly search constrained parameter spaces for robust designs. However, these random search-based approaches work best with small parameter spaces and provide no insight into why some designs succeed and others fail.
SD2 will address the problem of design in domains that lack complete models by discovering models and refining designs via methods that extract information from data at petabyte scale. SD2 aims to develop data-driven methods to automatically discover models and refine designs in parallel at scale. To ensure realism, challenge problems drawn from cutting edge domains will drive the development of SD2 methods. By the end of the program, SD2 plans to provide new data-driven methods to accelerate discovery and design and create a cloud-based open data exchange for research communities in complex domains.
Initially, SD2 will use challenge problems from synthetic biology for program wide evaluation. Synthetic biology provides a compelling driving application domain for the following reasons:
(1) full characterization of the underlying biology requires complex mechanistic formalisms at a scale that defies manual discovery; (2) advances in the last 20 years have enabled scientists to more efficiently modify organisms in more complex and targeted ways; (3) terabytes of high quality synthetic biology data can be generated in a few days; and (4) design of synthetic biology systems remains a very laborious process that typically involves heuristic knowledge, brute force, and trial and error approaches.
Model discovery is a primary research focus for SD2, as accurate models can accelerate design in domains that lack high fidelity simulations. To facilitate model discovery, SD2 plans to advance computational analysis of raw experimental data at petabyte scale. The program will virtualize experimental workflows in order to increase access to high-throughput experimental resources.
High-throughput experiments will provide large increments of high quality data from multiple labs. This data will flow into an SD2 analysis hub that serves as a data-sharing ecosystem for scalable, automated discovery and design. Novel discovery algorithms generated by the SD2 program will automatically process experimental data with the objective of detecting unexpected findings and refining system models.
SD2-developed design algorithms will use the refined models to generate, test, and validate system designs. SD2 techniques and tools will be evaluated using challenge problems drawn from domains such as synthetic biology, polymer chemistry, and neuro-computation.
At regular intervals, SD2 performers who are subject-matter experts will propose increasingly complex challenge problems based on lessons learned from experimental successes and failures.
Notional design challenge problems may include the ability to design a 30-component circuit for nuclear waste absorption (in synthetic biology) or design a polymer that constricts in response to contact with a toxin (in polymer chemistry).
The program structure includes five technical areas (TAs) that must work collaboratively to achieve program goals.
TA1 (Data-Centric Scientific Discovery):
TA1 performers must develop computational methods that convert experimental data into scientific knowledge that design algorithms can use for computation. TA1 algorithms must analyze experimental data, detect experimental surprise, diagnose failure, extract patterns, and use those patterns to refine system models. SD2 aims to reduce the amount of human intervention required for discovery tasks by creating methods that automate routine, mundane operations. Such automation will refocus human efforts on interesting scientific questions.
TA2 (Design in the Context of Uncertainty):
Extend engineering and planning formalisms to enable automated refinement of intermediate designs.
TA2 performers must develop design algorithms that provide novel design capabilities in domains with incomplete models. Design algorithms will be used to address domain-specific design challenges such as: design a 30-component circuit for nuclear waste absorption (in synthetic biology) or design a polymer that constricts in response to toxin (in polymer chemistry).
The TA2 design algorithms will consist of two classes: (1) algorithms for targeted design of system components, and (2) algorithms for experimental planning.
TA3 (Hypothesis and Design Evaluation):
Experimentally evaluate the designs and hypothesized system knowledge generated by TA1 and TA2 in a manner that tests for reproducibility and robustness.
TA3 performers must generate experimental data throughout all phases of the program to validate TA1 hypotheses and evaluate TA2 designs. TA3 performers must provide computer readable application program interfaces (APIs) that enable both subject matter experts and computer algorithms to submit experiment requests. Ideally, automated workflows will remotely log experiment and contextual data in a manner that facilitates data comparison between geospatially distributed laboratories.
TA3 proposals must specify an intended application domain, describe data collection capabilities, and outline at least three relevant challenge areas. For each challenge area, proposers should identify a series of notional design and discovery challenges that could drive TA1 and TA2 development. Proposals should connect challenge problems with notional experiments that would provide relevant data. For each challenge, TA3 proposers should also explain the impact that a novel solution would have within the research community or the Department of Defense (DoD). More information about the types of challenge problems of interest to SD2 can be found in the “Challenge Problems” section (see page 5) and “Demonstration Efforts” section (see page 8).
TA4 (Data and Analysis Hub):
Extend open source tools to virtualize access to computational resources, data, and results in a manner that links multiple labs and research communities.
TA5 (Challenge Problem Integrator):
Work with domain experts throughout the program to establish quarterly challenge problems to drive development of automated, data-driven methods that accelerate complex system design at scale.
Ginkgo Bioworks and Transcriptic selected by DARPA to leverage robotic cloud lab and foundry automation to accelerate biological design with $9.5 M award
Design in biology has largely been driven by trial-and-error experimentation. Biological modeling has been hampered by imprecise measurement technologies and a lack of scalable experimental infrastructure making the exploration of complex cellular systems difficult. Under SD2, the large-scale collaboration between Ginkgo and Transcriptic will work to address this shortcoming by bringing engineering principles to the design of experimentation so that geographically dispersed teams can collaborate and leverage automation in the lab.
Using machine learning tools to analyze massive amounts of data, Transcriptic and Ginkgo, along with other collaborators, will work to develop new platforms to accelerate discovery and design, as well as produce a cloud-based open data exchange to benefit research and academic communities in other scientific domains. Transcriptic and Ginkgo will each run automated experimental facilities to enable cross-site validation and the development of reproducible experiments.
As part of this ambitious program, the vast, automated experimental capabilities from Ginkgo Bioworks and Transcriptic will be programmatically connected to machine learning-driven design and analysis algorithms. Ginkgo’s foundry generates terabytes of biological data, including genome sequences and transcriptomic, metabolomic, and proteomic data. This data will be analyzed using advanced artificial intelligence to design experiments performed on Transcriptic’s cloud lab capable of 24/7 automated execution.
“Ginkgo’s founding mission is to make biology easier to engineer, and our comprehensive measurement of relatively simple systems — genome, transcript, proteome, and metabolome — provides us with the essential basis for creating predictive models of biology,” said Tom Knight, co-founder of Ginkgo Bioworks. “Only with this depth of knowledge will we have the ability to rationally engineer biology. We’re thrilled to collaborate with Transcriptic in automating high-throughput production and interpretation of this data.”
“The goal of creating a closed loop AI-driven experimentation platform is aligned with our mission and expertise. By computationally controlling Transcriptic’s core automated biological platform, we hope to accelerate science by rapidly and efficiently converting scientific findings into reliable processes and outcomes,” said Yvonne Linney, Transcriptic Chief Executive Officer.
References and Resources also include: