The Department of Defense (DoD) is one of many government agencies that operates globally and is in constant contact with diverse cultures. Communicative understanding, not simply of local languages but also of social customs and cultural backgrounds, lies at the heart of Civil Affairs and Military Information Support Operations activities.
These collectively comprise a vast majority of U.S. counterinsurgency and stabilization efforts. Within these activities, cross-cultural miscommunication can derail negotiations, incite hostile discourse – even lead to war. The likelihood of communicative failure increases dramatically where significant social, cultural, or ideological differences exist.
Automated systems would be a welcome force-multiplier for DoD interpreters. However, unlike the human cultural interpreters who enable U.S. forces today, current AI-enabled systems are incapable of accurately analyzing cross-cultural communication or providing useful assistance beyond basic machine translation. While there have been significant advances made in machine learning and multimedia analysis, a number of critical deficiencies in these systems still remain.
To assist negotiations and aid critical interactions, DARPA launched the Computational Cultural Understanding (CCU) program. The goal of CCU is to create a cross-cultural language understanding service to improve a DoD operator’s situational awareness and ability to effectively interact with diverse international audiences. The program seeks to develop natural language processing (NLP) technologies that recognize, adapt to, and recommend how to operate within the emotional, social, and cultural norms that differ across societies, languages, and communities.
To support diverse and emergent use cases, CCU technologies will be engineered to require minimal-to-no training data in a local culture, while maximizing operator success during negotiations and other interactions in the field. Instead of relying primarily on annotated training data, systems will leverage qualitative and quantitative
findings from fields such as psychology, sociology, or other relevant disciplines, as well as minimally-supervised machine learning techniques, in order to infer the meaning of unlabeled discourse behaviors in context.
“To support users engaged in cross-cultural dialogue, AI-enabled systems need to go beyond providing language translation – they need to leverage deep social and cultural understanding to assist communication,” said Dr. William Corvey, a program manager in DARPA’s Information Innovation Office (I2O). “Moving AI from a tool to a partner in this capacity will require significant advances in our machines’ ability to discover and interpret sociocultural factors, recognize emotions, detect shifts in communication styles, and provide dialogue assistance when miscommunications seem imminent – all in real-time.”
To remedy these deficiencies and advance communication technology towards enablement of greater cultural understanding, the CCU program will address the following research topics:
- Automatic discovery of sociocultural norms. Humans acquire inferred knowledge of diverse and varied socio-cultural norms through a lifetime of learning and interaction.CCU aims to emulate this learning capability, creating technologies that are capable of automatic discovery of the sociocultural norms that influence discourse, including the social, cultural, and contextual factors that impact effective communication and rapport building. This requirement will be especially challenging to meet, due to the fact that unsupervised techniques will require the development of new features to address this research problem, while comprehensive and varied annotated data sets remain unavailable.
- Generalization of emotion recognition across cultures. CCU also aims to create a capability that can recognize speaker emotions across different languages and cultures. Human interpreters continuously monitor emotional feedback from conversational participants (e.g. facial expression, tone of voice, diction), using this information to gauge how the interaction is progressing and alter the exchange as needed. In order to interpret speaker emotions as influenced by sociocultural context, CCU will focus on developing multimodal human language technologies capable of generalizing their recognition of emotion across different languages and cultures.
However, current systems degrade significantly in cross-cultural test conditions that may mirror DoD emergent use cases, where neither labeled nor unlabeled training data are available. Two major limitations of current approaches are the high cost of training data creation for each new language and a low ability to address low-resource languages at all.
- Detecting impactful changes in communicative practice at multiple timescales. In order to identify shifts in norms and emotions that are indicative of communicative failure or
impending conflict, human language technologies must be able to detect important changes in communication. While promising change detection methods exist, current frameworks lack an understanding of which features are most crucial to detecting
imminent communicative failure.
- Providing dialogue assistance to cross-cultural interaction. To help promote truly effective cross-cultural interaction, CCU technologies must be able to not only detect potential misunderstandings but also generate alternative socioculturally-appropriate responses. The ability to analyze conversations for evidence of cross-cultural misunderstanding and suggest remediation measures is crucial for effective communication. No existing technologies, however, are able to provide real-time dialogue assistance in cross-cultural settings. I
To build these models, DARPA will split the research into three technical areas:
- T1 – Sociocultural Analysis;
- T2 – Cross-Cultural Dialogue Assistance;
- T3 – and Data Creation for Development and Evaluation.
TA1 researchers will concentrate on research and development corresponding to three distinct
research tasks, namely: (1) discovery of sociocultural norms; (2) emotion recognition as a
function of sociocultural norms; and (3) detection of impactful changes in sociocultural norms
and emotions. Proposals must address all three tasks. TA1 work is expected to involve both
unimodal and multimodal inputs, consisting of text, speech, images, and video. The results of
efforts in this research area will be compiled to facilitate TA2 integration.
In CCU, TA2 researchers will leverage
TA1 analytic outputs as components of a dialogue assistance service able to follow ongoing
conversations, detect misunderstandings and discord in real-time, and suggest culturally and
socially-appropriate conversational actions for remediation. For example, negotiation efforts
would benefit greatly from automated assistance in promoting mutual understanding in the face
of initial or even persistent disagreement or opposition among communicants.
TA2 research must confront multiple challenges that remain unaddressed by current dialogue systems, including automated detection of sociocultural context (e.g., communicants’ social roles, relative ages, genders, etc. as well as specifics of the social setting), automated identification of the need for operator assistance, and dialogue generation, all while incorporating program-external
machine translation components. TA2 algorithms must be capable of detecting sociocultural
settings from language and image inputs, of revising operator utterances to increase interactional
effectiveness, and of incorporating culture-independent techniques that enable generalization to
approximately six cultures/languages by program completion
The goal of TA3 is to create data for development and evaluation in multiple cultures/languages
that support research work by TA1 and TA2, as well as performance measurement for the
individual components and the end-user application. The first language + culture pair will be Chinese (Mandarin).
TA1 will require approximately 50,000 documents per language for use in development and
evaluation; of this data set, 20% will be annotated and the other 80% unlabeled. A document is
defined as a portion of text (which may include associated image), audio, or video showing a
conversation involving two or more participants. Each textual document should contain at least
300 words and each video or audio segment should comprise at least five minutes, although
longer documents are preferred. Video (with accompanying audio) should comprise 75% of the
development and evaluation data set with the remaining 25% distributed across audio and text.
TA3 proposals should describe the generality of the data included
The first two are expected to be awarded to multiple competing companies, while the third, an effort to collect tens of thousands of documents per language (20 percent annotated and the other 80 percent unlabeled), will go to a single enterprise.
At key milestones, the systems will be evaluated by the National Institute of Standards and Technology, as well as an undisclosed federal research center.
The models are expected to work across different languages and cultures, given the global spread of US military actions. In Afghanistan, for example, the two main languages are Pashto and Dari, but there are more than 40 minor languages, with around 200 dialects. In Iraq, the two main languages are Kurdish and Arabic.
Many other languages and dialects are found in Syria, Yemen, Somalia, Libya, and Niger – all countries the US admitted it was at war with, back in 2018. As of 2019, US troops are officially in combat in at least 14 countries, undertake military exercises in 26 more, and conduct ‘counterterrorism training’ in yet another 65.