Social science provides a variety of research and educational capabilities to military to address the human dimensions of military organizations and their operational contexts. For instance, psychological and human performance criteria are firmly rooted in social science constructs within the U.S. Department of Defense (DOD).
In the United States, military information support operations (MISO) and operational research systems analysis (ORSA) functions require the application of social science concepts, albeit through a semi-rigid doctrine filter. A common MISO conundrum lies in confident separation between what is needed for phenomena to occur (measure of performance) versus what increases/decreases effects of success for said phenomena (measure of effectiveness). Similarly, ORSA professionals create assessment tools based on data requirements versus a theory. Regrettably, these tools are rarely designed to inform each other toward a common data picture, and neither rests on a sound theoretical base.
Perhaps no military area or agency is more reliant on social science and data than those in the intelligence field. Intelligence analysts are intentionally selected based on language, region, and/or cultural expertise to provide context to continuously gathered collections data. While intelligence analysts are conceivably the least wedded to doctrine, they are, however, expected to use social science methodology to justify findings as opposed to explaining them. The end result, and akin to MISO, is that measures of performance and effectiveness can become jumbled in intelligence analyst reports.
It’s hard to imagine a problem that’s important to national and international security that doesn’t somehow involve understanding human social behavior, and as we are moving into an era of increasingly complex social systems and interactions, leveraging social and behavioural sciences seems vital. So knowing what kind of confidence one should have in certain research claims could be critical for making progress in solving many of these important problems, says DARPA.
The ability to reproduce and replicate results and claims are hallmarks of scientific progress, without which we struggle to differentiate between chance and bias versus saying something true. Irreproducibility can be caused by many different factors such as statistical mistakes, cutting corners with methodology, lack of standardised reporting of materials, academic fraud and more.
Assessing the quality of science is tricky, often unsystematic, and filled with potential for bias. It’s often not enough to judge a finding based on the journals it was published in (prestigious journals certainly endure replication failures and retractions), or which university it came from, or whether results were “statistically significant” (an extremely fraught metric to rate the quality of science).
Endogenous signals are inherent to a study or article, such as evidence of questionable research practices, sample sizes, pre-registration of the research, open data, shared code, and so on. Exogenous signals exist in the wider research ecology, such as author and institution reputation, impact factors, peer review networks, social media commentary, post-publication review, ‘tighter or looser’ scientific communities and cliques, sources of funding, potential conflicts of interest, rates of retractions in that discipline and more.
And, like a credit score, there are lots of weak signals that can be aggregated now using automated techniques that weren’t available even five or ten years ago. Now is the time to push and see if we can build on tools like the r-factor to build something more comprehensive.
DARPA believes that a machine-learning-derived computer program (a form of AI) that could rapidly assess and predict the reliably of a scientific finding would be helpful. The idea is that with machine learning, a type of artificial intelligence, a computer can look for patterns among papers that failed to replicate and then use those patterns to predict the likelihood of future failures. If scientists saw a psychological theory was predicted to fail, they might take a harder look at their methods and question their results. They might be inspired to try harder, to retest the idea more rigorously, or give up on it completely. Better yet, the program could potentially help them figure out why their results are unlikely to replicate, and use that knowledge to build better methods, writes Brian Resnick a science reporter at Vox.com.
Therefore DARPA launched the project is called SCORE (Systematizing Confidence in Open Research and Evidence), and it’s a collaboration with the Center for Open Science in Virginia, with a price tag that may top $7.6 million over three years. Our SCORE program aims to develop automated tools that assign “confidence scores” to social & behavioral science research results & claims, says DARPA.
The military is interested in the replicability of social science because it’s interested in people. Soldiers, after all, are people, and lessons from human psychology — lessons in cooperation, conflict, and fatigue — ought to apply to them. And right now it’s hard to assess “what evidence is credible, and how credible it is,” says Brian Nosek, the University of Virginia psychologist who runs the Center for Open Science. “The current research literature isn’t something [the military] can easily use or apply.”
DARPA Aims to Score Social and Behavioral Research
The Pentagon’s innovation incubator has set itself an ambitious task – ranking the reliability of social science research that might apply to national security. For SCORE, reproducibility is being defined as the degree to which other researchers can successfully compute an original result when given access to a study’s underlying data, analyses, and code. Replicability is being defined as the degree to which other researchers can successfully repeat an experiment.
The Defense Advanced Research Projects Agency’s Defense Sciences Office is currently asking for “innovative research proposals” to algorithmically assign a confidence score to social and behavioral research. DARPA has named this program to develop an artificially intelligent quantitative metric Systematizing Confidence in Open Research and Evidence, or SCORE. As DARPA explains in its request for proposals:
These tools will assign explainable confidence scores with a reliability that is equal to, or better than, the best current human expert methods. If successful, SCORE will enable [Department of Defense] personnel to quickly calibrate the level of confidence they should have in the reproducibility and replicability of a given SBS result or claim, and thereby increase the effective use of SBS literature and research to address important human domain challenges, such as enhancing deterrence, enabling stability, and reducing extremism.
Outside observers have identified a wider collateral benefit to the academy from the proposal – a tool to address the so-called replication crisis in social science. A computer program that quickly scores a social science finding for its potential to replicate would be useful in lots of other areas too, like science funding agencies trying to decide what’s worth spending money on, policymakers who want the best research to inform their ideas, and the general public curious to better understand themselves and their minds. Science journalism would benefit as well: adding an additional sniff test to avoid hyped-up findings. An article by Adam Rogers at Wired, for example, is headlined “Darpa Wants to Solve Science’s Reproducibility Crisis With AI.”
DARPA implies that the replication crisis is itself a national security concern: “Taken in the context of growing numbers of journals, articles, and preprints, this current state of affairs could result in an SBS consumer mistakenly over-relying on weak SBS research or dismissing strong SBS research entirely.”
Last month, DARPA signed the Center for Open Science (COS) to a three-year agreement, worth $7.6 million, to create a database of 30,000 claims made in peer-reviewed and published papers. Alongside partners from the University of Pennsylvania and Syracuse University, COS will extract – automatically and manually – evidence about the claims, which will be merged with more traditional quality indicators like citations and whether the research was preregistered.
Three steps will follow once the database exists:
Experts will examine 10 percent of the claims, using surveys, panels and even prediction markets, for their likelihood of being replicated. Other experts will create algorithms to examine the database’s contents and determine, artificially, their likelihood of being replicated.
Other researchers will attempt to replicate a sample of the database’s claims, allowing both the humans’ and the computers’ efforts to be measured and scored. Appropriately, COS says its own work need to be reproducible. “We are committed to transparency of process and outcomes so that we are accountable to the research community to do the best job that we can,” said COS program manager Beatrix Arendt, “and so that all of our work can be scrutinized and reproduced for future research that will build on this work.”
“Whatever the outcome,” according to Brian Nosek, COS’ executive director, “we will learn a ton about the state of science and how we can improve.”
Rogers quote Microsoft sociologist Duncan Watts about the audacity of creating a scoring mechanism: “It’s such a DARPA thing to do, where they’re like, ‘We’re DARPA, we can just blaze in there and do this super-hard thing that nobody else has even thought about touching.’” Watts then adds, ““Good for them, man.” (Further demonstrating its chutzpah, DARPA has specifically excluded from SCORE proposals “research that primarily results in evolutionary improvements to the existing state of practice.”)
Ideally the scores and how they were determined would be understandable to a non-specialist. In addition, the scores could change based on new information.
As it tries to grade social and behavioral research, DARPA clearly acknowledges the need to fully embrace social science. “Given the accelerating sociotechnical complexity of today’s world—a world that is increasingly connected but often poorly understood—there are growing calls to more effectively leverage Social and Behavioral Sciences (SBS) to help address critical complex national security challenges in the Human Domain,” DARPA wrote in a 41-page document announcing the program in June 2018.
In addition to citing work that has obvious applications to security, such as reducing extremism, the documents cited other federal projects that have explicitly connected SBS and the Pentagon, such as the National Academies of Science’s Decadal Survey of Social and Behavioral Sciences for Applications to National Security and the Minerva Research Initiative (“Supporting social science for a safer world”).