The Defense Sciences Office (DSO) at the Defense Advanced Research Projects Agency (DARPA) is soliciting innovative research proposals in the area of autonomous molecular design to accelerate the discovery, validation and optimization of new, high-performance molecules for Department of Defense (DoD) needs. DARPA Accelerated Molecular Discovery (AMD) program, aims to develop new, AI-based systematic approaches that increase the pace of discovery and optimization of high-performance molecules. A Proposers Day webinar describing the goals of the program is scheduled for Oct.18, 2018
The efficient discovery and production of new molecules is essential for a range of military capabilities—from developing safe chemical warfare agent simulants and medicines to counter emerging threats, to coatings, dyes, and specialty fuels for advanced performance. Another example is Energetics, meaning energy-dense molecules used for applications such as explosives and propellants. Current approaches to develop molecules for specific applications, however, are intuition-driven, mired in slow iterative design and test cycles, and ultimately limited by the specific molecular expertise of the chemist who has to test each candidate molecule by hand.
DARPA seeks to develop new, systematic approaches that increase the pace of discovery and optimization of high-performance molecules through development of AI-based, closed-loop systems that that automatically extract existing chemistry data from databases and text, perform autonomous experimental measurement and optimization, and use computational approaches to develop physics-based representations and predictive tools.
Such methods will ultimately enable AI-based design and discovery of completely new molecules that are optimized across multiple molecular properties for specific DoD applications. Proposed research should investigate innovative approaches that enable revolutionary advances in science, devices, and systems related to small organic molecules.
AMD performers will develop tools, models, and experimental capabilities to rapidly design, validate, and optimize molecules. Government partners will evaluate performer developments and test their ability to identify new molecules with specific combinations of functional properties that may be relevant to specific DoD application requirements.
“The ultimate goal of AMD is to speed the time to design, validate, and optimize new molecules with defined properties from several years to a few months, or even several weeks,” said Anne Fischer, program manager in DARPA’s Defense Sciences Office. “We aim to develop the AI tools, models and experimental systems to enable autonomous design of molecules to quickly meet DoD needs.”
Even in industries that exploit high-throughput screening (HTS) of millions of molecules per day, subsequent design, synthesis and testing of molecular analogs is a months-to-years-long process. Brute force HTS of a common (albeit large) library has yet to yield productivity benefits over chemical intuition and past precedent, which still boasts a success rate well beyond HTS methods. Ultimately then, it is not just the rate of experiments that matters, but the quality and efficiency of those experiments. Given a potential structure space of ~10 exp( 60) for pharmacologically relevant molecules alone, with just over 10 exp(9) molecules of any kind reported to date, the possibilities are seemingly endless, but our current approaches are considerably constraining.
Artificial Intelligence (AI) methods have recently been applied to various aspects of chemistry, including synthetic route design, property prediction, inverse design, and process optimization. Though some successes have been realized, a number of technical challenges continue to limit the potential of AI in the molecular sciences. The first relates to the data provided to the AI systems, in terms of both the content and accessibility of existing chemical data, and how we currently generate new chemical knowledge. The second centers on representation of this data to AI, or determining what features of a molecular system are necessary to embody chemical character and how to describe them in algorithmically accessible form.
As suggested above, the sheer volume of data produced by HTS and other methods in the molecular sciences has not been sufficient to provide comparable insights to an expert chemist. Indeed, the value of new data to an AI model is a function of data already seen by the model, the particular AI model being used, and the task the model is performing.
Therefore, an essential question is how to efficiently acquire chemistry data that will prove most valuable to specific AI models and tasks. Existing data in chemistry, like many fields, is sparse, noisy and incomplete. For example, while we have access to databases containing reaction schemes for nearly 40 million molecules, less than a quarter of those include data on such a basic (and critical) parameter as reaction yield.
Moreover, those databases that do report yield are highly biased to high-yield reactions, as chemists report primarily optimized processes in relatively algorithm-accessible reaction schemes. A closer look at the literature, however, reveals that valuable information about reaction failures, lower yield and optimization experiments is often included in algorithm inaccessible form (e.g., text, tables, and figures).
Examples relevant to molecular properties that are the subject of this program abound as well, with critical data and metadata present in manuscripts, lab notebooks, etc. Existing tools such as named entity recognition (NER) and semantic role labeling (SRL) in Natural Language Processing (NLP) can identify mentions of terms of interest and relate them in meaningful ways.
Tools in the computer vision (CV) community exist that could aid in extracting knowledge from molecular structure diagrams, chemical equations, tables, and figures. Yet, tools from these communities have been sparingly applied to chemistry to augment our chemical knowledge and exploit past experiments. Perhaps an even greater challenge, however, is developing a means to rapidly generate new data through methods that more efficiently explore a search space, whether it be a query of structural variants to optimize a specific property or a process-level optimization across a series of experimental conditions.
While not standard in conventional chemistry laboratories, automated approaches are emerging, including semi-automated analytical feedback systems for property optimization and design-of-experiments (DOE)-like approaches for hands-free optimization of experimental processes. Such approaches, if effectively coupled to AI models through active learning, reinforcement learning, or adversarial learning methods, could generate data that is uniquely valuable to the AI model and task at hand, in terms of both training and validation.
Moreover, the data generated by automated experimental systems would be objective and complete in a way that is possible only with direct experimental observation. The potential of multi-objective optimization, particularly of multiple properties in a single molecule, makes this nascent application of automation and AI to chemistry even more appealing.
Chemists have long represented molecules using static, two-dimensional graphs, where the nodes are atoms and the edges are bonds. These same structure-based features are the dominant descriptors used for training statistical machine learning algorithms. Even when 100s to 1000s of structural features are incorporated into a representation of a molecule, such representations fail to capture known, critical property-dependent elements of a molecular system.
Physics-based data, such as molecular dynamics, quantum mechanics, and mechanistic information, would be valuable, but the computation time traditionally required by simulation platforms to generate this data for new molecules hinders its application to AI in chemistry. Perhaps strategies such as machine learning approximations of such physics data could overcome this limitation. Developing AI-based tools to accelerate molecular design and discovery most certainty must incorporate these elements to robustly represent what matters most in molecule-based property prediction. However, the implications of what this means—exactly which physics should be represented, how this should be done, and to what degree it matters in the context of a particular molecular property or domain—is yet unknown. Addressing these challenges in a meaningful way to provide practical software and hardware tools for molecular design requires development of a closed-loop system. With parallel development and real-time feedback of components for data extraction, data generation, and representations, AI models and optimization frameworks, components can be built, validated, and improved to evaluate their ultimate potential and fundamental limitations for applications in high-performance small molecule design and discovery.
AMD will increase the pace of molecular discovery and optimization, resulting in tools, techniques and closed-loop systems that design novel molecules with a desired set of properties. Performer teams will develop complete closed-loop systems that exploit, build, and integrate tools for 1) extracting existing data from databases and text; 2) executing autonomous experimental measurement and optimization; and 3) incorporating computational approaches to develop physics-based representations and predictive tools.
Tools, models, data, and systems will be transferred to Government teams throughout the program, potentially beginning as early as midway through the base period, for independent validation and verification (IV&V). AMD will emphasize creating and leveraging open source technologies and architectures, making data sharing and collaboration among performers and with Government IV&V teams key aspects of this program.
Intellectual property rights asserted by proposers are strongly encouraged to be aligned with open source regimes and be clearly defined in the proposal submission. AMD is focused on developing closed-loop systems that can measure, model, and predict among a set of fundamental molecular properties that are relevant to molecular function and performance in a given application. Properties of interest include boiling point, vapor pressure, density (including liquids), solubility (e.g., water/lipid), spectral signature (e.g., infrared/Raman), degradation products, toxicity, and viscosity.
Proposers may choose to build systems to measure, model, predict and optimize additional properties, but should select four from this set. Proposals should identify one or more potential applications (e.g., medicines, dyes, agrochemicals, coatings, fuels, etc.) for which their selected set of properties are relevant, as this affects aspects such as available data, target molecular classes, etc.
However, performers should focus on developing generalizable systems that enable design of molecules with specific properties that are informed, but not dictated by, an application.
Importantly, in addition to tool, model and system validation, Government IV&V teams will assess the feasibility of extending and applying tools developed for one application area to another. Understanding the challenges inherent in transitioning these models and tools among application areas will be a critical aspect of the program.
AMD performers will develop the approaches, methods, and tools to build closed-loop systems. These systems are divided into three Focus Areas (FAs) that pertain to the technical challenges and development necessary to realize the AMD goals: FA1: Data extraction from existing sources; FA2: Data generation via automated experimental platforms; and FA3: Representations, AI models, and optimization frameworks
FA1: Data extraction from existing sources
Performers will apply techniques such as natural language processing and computer vision to extract and transform information from sources such as articles, manuscripts, and electronic lab notebooks into formats exploitable by AI models (FA3). Data extraction from both text and imagery sources is required, to include prose, figures, graphs, and tables, as well as directly from analytical instrumentation, as is correlation and fusion of this data across sources.
DARPA encourages performers to exploit existing tools and approaches wherever possible. Development of novel methods is acceptable, if existing tools will not meet the performance metrics described in Section E, or are not compatible with other components of the proposed AMD closed-loop system. Proposals should fully describe the data sources that will be used during the AMD period of performance, and in particular, comment on existing formats and accessibility, to include if annotations and labels to enable evaluation as described in Section E are present or if such annotation will occur during the program. Performers should not rely on DARPA or Government IV&V teams to provide data for tool development, evaluation, validation, or system design. The data output of FA1 tools should complement that produced by automated experimental platforms (FA2) in developing and optimizing AI models (FA3).
FA2: Development of autonomous experimental platforms
Performers will develop and integrate the control software, reaction and robotics hardware, and instrumentation necessary to enable autonomous experimentation. It is expected that initially, some amount of human intervention may be required to execute experiments, but that the degree to which this is necessary will decline rapidly as the program progresses. Experimental platforms developed should be capable of executing experiments as directed and defined by the AI models (FA3), and in turn generate data of sufficient richness to inform and validate those models. Performers are encouraged to build on existing experimental systems and expertise, and leverage existing equipment where possible to adhere to the aggressive program schedule as well as to mitigate overall system cost.
FA3: Representations, AI models, and optimization frameworks
Performers will develop AI models and optimization frameworks that provide three capabilities: property prediction, process optimization, and inverse molecular design. In achieving these capabilities, it is expected that these models will 1) utilize techniques such as active learning, reinforcement learning, adversarial networks, and Bayesian optimization in order to identify the most informative data to be acquired by FA1 tools or FA2 experimental platforms to further increase the accuracy of the models; and 2) incorporate contextual and physics-based information (e.g., molecular dynamics and others) in their representation of data. It is expected that while FA1 HR001119S0003 ACCELERATED MOLECULAR DISCOVERY (AMD) 9 and FA2 will be primary sources of such information, other means (e.g., simulation platforms or approximations thereof) will be utilized to augment and further enrich the data provided by FA1 and FA2. Proposals should fully describe the balance between experimental and virtual generation of data for AI models, and the scale and speed at which those data will be generated and integrated. Approaches that allow for the simultaneous optimization of multiple objectives (i.e., properties or process parameters) are required.