Home / Technology / AI & IT / DARPA’s SafeDocs Initiative: Revolutionizing Cybersecurity for Electronic Data Formats

DARPA’s SafeDocs Initiative: Revolutionizing Cybersecurity for Electronic Data Formats


In an era dominated by digital interactions and data-driven operations, safeguarding electronic documents has become a critical facet of national security and individual privacy. The Defense Advanced Research Projects Agency (DARPA) recognizes the critical need for secure electronic data processing and has launched the Safe Documents (SafeDocs) initiative.

Through its Safe Documents (SafeDocs) initiative, DARPA is actively developing innovative cyber tools aimed at enhancing the security of electronic data formats. This groundbreaking program seeks to address the vulnerabilities inherent in current software that handles electronic data, particularly documents, messages, and data streams. This blog explores the significance of this initiative and the potential impact it could have on the future of cybersecurity.


The Imperative for Safe Electronic Data Processing:

As technology evolves, so do the methods employed by cyber threats. Electronic documents, ranging from PDFs to multimedia files, are transmitted and received daily from various sources. The complexity of managing these diverse data formats leaves software susceptible to cyber attacks.

Individuals and organizations must routinely engage with electronic documents received from a variety of unauthenticated or potentially compromised sources, comprising a growing variety of electronic data formats. Even if the immediate provider of the data can be authenticated, the data may derive from an untrusted source. We expect pictures, charts, spreadsheets, maps, audio, video, as well as rich messages potentially including any and all of these, to be received with a click of a button, DARPA researchers point out.

DARPA asserts that the current situation is unsustainable, necessitating innovative solutions to secure the processing of electronic data.

Real-World Cyber Threats:

A notable incident highlighting the vulnerability of electronic data occurred on December 23, 2015, when a Ukrainian electricity distribution company reported service outages. Investigations revealed that the attackers weaponized Microsoft Office documents to deliver malicious payloads, causing significant disruptions.

The study and analysis found that the adversaries weaponized Microsoft Office documents (Excel and Word) by embedding BlackEnergy 3 within the documents. During the cyber intrusion stage of Delivery, Exploit, and Install, the malicious Office documents were delivered via email to individuals in the administrative or IT network of the electricity companies. When these documents were opened, a popup was displayed to users to encourage them to enable the macros in the document. Enabling the macros allowed the malware to Exploit Office macro functionality to install BlackEnergy 3 on the victim system. Upon the Install step, the BlackEnergy 3 malware connected to command and control (C2) IP addresses to enable communication by the adversary with the malware and the infected systems. These pathways allowed the adversary to gather information from the environment and enable access.

This case underscores the urgent need for robust cybersecurity measures in processing electronic data.

Current Challenges in Software Security:

The software that processes electronic data is error-prone and vulnerable to exploitation, with over 80% of reported vulnerabilities linked to code handling input data. The reliance on manual coding for input validation, especially for widely-used electronic data formats like JSON and XML, poses scalability and security challenges.

Exploitation of input-handling vulnerabilities leverages inaccurate programmer assumptions regarding the extent to which input data has been validated by input-handling code. Code that behaves correctly under certain assumptions (and may even be proven correct under these assumptions) will typically not behave correctly if any of these assumptions do not hold. Attackers can induce incorrect behaviors by presenting vulnerable software with maliciously crafted input data that violates unchecked assumptions. The programmer assumes that validated input data contains certain objects in certain relationships, and writes code under these assumptions. However, should any of these assumptions not hold, the code will not behave correctly. A single missing or incorrect check can create a vulnerability, as was the case with the Heartbleed vulnerability (CVE-2014-0160), in which code acting on an unchecked assumption exposed sensitive memory content to remote attackers.

Today, code for input data validation is typically written manually in an ad-hoc manner. Manually writing the code to parse and validate input, and then manually auditing whether that code implements all the necessary checks completely and correctly, does not scale. Moreover, manual parser coding and auditing typically fails even for electronic data formats specifically designed to be easier to perform such tasks, e.g., JSON and XML. A variety of critical vulnerabilities have been found in major parser implementations for these formats.

Common mitigations, such as preventing the flow of untrusted data and software testing through fuzzing, have limitations and don’t offer foolproof security guarantees. Widely deployed mitigations against crafted input attacks include (a) trying to prevent the flow of untrusted data to vulnerable software; and (b) testing software with randomized inputs to find and patch flaws that could be triggered by maliciously created inputs. Unfortunately, neither of these approaches offer security assurance guarantees.

Mitigations for preventing the flow of untrusted data to vulnerable software, which can be implemented via network or host-based measures such as firewalls, application proxies, antivirus scanners, etc., neither remove the underlying vulnerability from the target, nor encode complete knowledge of document or message format internals. Attacker bypasses of such mitigations exploit incompleteness of the mitigations’ understanding of the data format to exploit the still-vulnerable targets.

The effectiveness of fuzzing methods for testing of software with randomized inputs to find and fix flaws depends on whether randomly generated inputs can emulate maliciously crafted inputs closely enough to trigger all relevant code flaws. Although modern fuzzing methods incorporate feedback from tracing the execution of the code as it consumes crafted inputs, they also employ symbolic and concolic execution of code in their exploration of the space of potential crafted inputs. As a result, these methods are still essentially heuristic. There is no guarantee that attackers, who also use fuzzing to locate and develop vulnerabilities, will not cover a more substantial and more productive portion of the input space with a different set of heuristics.

DARPA recognizes the imperative to stay ahead of malicious actors by innovating in the realm of cybersecurity. SafeDocs is a testament to this commitment.

DARPA’s SafeDocs Program:

DARPA is soliciting innovative research proposals in the area of secure processing of untrusted electronic data.

DARPA’s SafeDocs program is a multi-faceted initiative aimed at revolutionizing the secure processing of untrusted electronic data. The program focuses on developing verified programming methodologies for building high assurance parsers, particularly for existing electronic data formats. SafeDocs takes a unique approach by simplifying and reducing complex data formats into secure, unambiguous subsets through a toolkit that facilitates secure encoding.

Proposed research should investigate innovative approaches that radically improve software’s ability to recognize and safely reject invalid and maliciously crafted input data, while preserving essential functionality of legacy electronic data formats. Proposals should build on an existing base of knowledge of electronic document, message, and streaming formats and the nature of security vulnerabilities associated with these formats.

Key Objectives of SafeDocs:

  1. Format-Agnostic Security: SafeDocs aims to develop security measures that transcend specific document formats. Traditional cybersecurity solutions often cater to particular file types, leaving vulnerabilities in others. SafeDocs seeks to create a format-agnostic approach, ensuring robust protection for a wide array of electronic documents.
  2. Dynamic Threat Detection: One of the groundbreaking aspects of SafeDocs is its emphasis on dynamic threat detection. Rather than relying solely on static security protocols, the initiative explores adaptive methods that can identify and neutralize emerging threats in real-time. This proactive approach is essential in an ever-evolving digital landscape.
  3. User-Friendly Implementation: Recognizing that effective cybersecurity measures must be practical and user-friendly, SafeDocs endeavors to develop tools that seamlessly integrate into existing workflows. This ensures that individuals and organizations can adopt these security measures without sacrificing operational efficiency.
  4. Collaborative Security Ecosystem: SafeDocs promotes the creation of a collaborative security ecosystem. By fostering partnerships between government agencies, private entities, and cybersecurity experts, DARPA aims to pool collective expertise for more robust, comprehensive solutions. This collaborative approach enhances the initiative’s effectiveness and adaptability.

Potential Impacts on Cybersecurity:

  1. Enhanced National Security: The SafeDocs initiative has the potential to significantly enhance national security by fortifying the protection of sensitive government documents. As electronic communication becomes more prevalent in official channels, securing these documents is paramount.
  2. Protection of Sensitive Information: With the proliferation of electronic documents in various sectors, from healthcare to finance, SafeDocs can play a pivotal role in safeguarding sensitive information. This is crucial for protecting individuals and organizations from cyber threats ranging from identity theft to corporate espionage.
  3. Mitigating Emerging Threats: The dynamic threat detection capabilities of SafeDocs address the challenge of emerging threats in real-time. This adaptability ensures that cybersecurity measures remain effective even as cyber threats evolve and become more sophisticated.
  4. Global Cyber Resilience: Cybersecurity is a global concern, and innovations like SafeDocs contribute to global cyber resilience. By setting new standards in document security, this initiative can influence international efforts to fortify digital infrastructure and protect against cyber threats.

Safe Documents (SafeDocs) program

The Safe Documents (SafeDocs) program will develop novel verified programming methodologies for building high assurance parsers for extant electronic data formats, and novel methodologies for comprehending, simplifying, and reducing these formats to their safe, unambiguous, verification-friendly subsets (“safe sub-setting”). SafeDocs will address the ambiguity and complexity obstacles that hinder the application of verified programming posed by extant electronic data formats. SafeDocs’ multi-pronged approach will combine:

  • extracting the extant formats’ de facto syntax (including any non-compliant syntax deliberately accepted and substantially used in the wild);
  • identifying a syntactically simpler subset of this syntax that yields itself to use in verified programming while preserving the format’s essential functionality; and
  • creating software construction kits for building secure, verified parsers for this syntactically simpler subset, and high-assurance translators for converting extant instances of the format to this subset.

The parser construction kits developed by SafeDocs will be usable by industry programmers who understand the syntax of electronic data formats but lack the theoretical background in verified programming. These tools will enable developers to construct verifiable parsers for new electronic data formats as well as extant ones. The tools will guide the syntactic design of new formats, by making verification-friendly format syntax easy to express, and vice versa.

Collaboration and Contract Awards:

DARPA is funding a number of research projects as part of the SafeDocs Initiative. These projects are exploring a variety of approaches to developing more secure electronic data formats, including:

  • Formal methods: Formal methods are a mathematical approach to software development that can be used to prove that software is free of vulnerabilities.
  • Static analysis: Static analysis tools can be used to identify potential vulnerabilities in software without actually running the software.
  • Dynamic analysis: Dynamic analysis tools can be used to identify vulnerabilities in software by running the software and observing its behavior.
  • Machine learning: Machine learning can be used to identify patterns in data that can be used to detect vulnerabilities.

DARPA has enlisted the expertise of several institutions and companies in the SafeDocs program. Contract awards have been granted to Galois Inc., Northrop Grumman Corp., BAE Systems, and Penn State. These entities will contribute to the research and development of innovative cyber tools to enhance the security of electronic data formats.

BAE Systems’ Role:

As part of the SafeDocs program, BAE Systems aims to develop two cyber tools. The first tool focuses on simplifying and automatically selecting safe feature subsets within electronic data formats, ensuring safe and unambiguous encoding. The second tool is a toolkit to help software developers avoid vulnerabilities when processing complex electronic data. BAE Systems emphasizes the importance of creating solutions that prioritize security to counter the growing risk of cyber attacks.

Penn State’s Contribution:

Penn State, as part of the collaborative effort, is dedicating research efforts to increase computer security by developing more secure parsers. A parser, the element in a computer system that converts data inputs into an understandable format, is the first line of defense for cybersecurity

The Secure Parser Toolkit for Assurance (SPARTA) project aims to create parsers with provable guarantees about safety, mitigating common vulnerabilities found in current parser implementations. Parsers are considered a critical security piece in many systems because they should be able to identify adversarial elements and warn a system user that the program in question may be taking malicious input. However, a cyber attacker could feed malformed data that would trigger bugs in the parser to take over the system. Tan and his research team aim to create parsers that have provable guarantees about safety and are not susceptible to the many bugs that parsers commonly have now.

“There are tools you can use to manually write those parsers, but, in the end, you don’t get many guarantees,” Tan said. “You just rely on the competence of the programmers, and often, these parsers are very complex. Programmers make mistakes, and as a result those mistakes cause vulnerability in computer systems.” For example, at the time that Tan submitted his research proposal, over 1,000 parser bugs were reported for the popular suite of Mozilla products, impacting the security of many common file formats including PDF, ZIP, PNG and JPG.

Tan said that he hopes that with the creation of the SPARTA system, he will potentially be able to develop the most secure parsers to date with a novel parser language and rigorous formal methods.  “This parser would warn you if this PDF document is not obeying some safe subset of the format. If this parser agrees to open it, you’re guaranteed to be safe. There’s a provable way of saying it’s safe.” While the project’s focus is on parsers for PDF security, the researchers hope their new system can be applied to other formats, including for videos and images.


A suite of tools is available to bolster the security of electronic documents, aiding software developers and cybersecurity researchers in managing various data formats. Specifically designed for Portable Document Format (PDF), tools like Arlington PDF Model, Digital Corpora Project Corpus, and SPARCLUR offer wrappers for parsers and renderers, coupled with analytical capabilities.

Arlington PDF Model Digital Corpora Project Corpus SPARCLUR: A collection of various wrappers for extant PDF parsers and/or renderers along with accompanying tools for comparing and analyzing the outputs from these parsers.

Programmer resources include DaeDaLus and Hammer for secure parser/scanner construction in C, along with Parsley, a modular data description language for formats like MAVLink, PDF, and Executable and Linking Format (ELF).

Document collection understanding is facilitated by tools like File Observatory, Format Analysis Workbench, Dowker tools, and PolyFile, contributing to the visualization, search, and analysis of intricate file format patterns. Additionally, PolyTracker and Graphtage aid in comprehending existing parser code through data-flow and control-flow analysis, and semantic comparison of tree-like structures. Fickling serves as a decompiler and static analyzer for Python pickle object serializations, offering a comprehensive set of tools to enhance electronic data format security.


As electronic data formats continue to evolve, so too must our strategies for protecting them. DARPA’s SafeDocs initiative represents a crucial step forward in addressing the pressing cybersecurity challenges associated with electronic data processing. By fostering collaboration and investing in innovative research, DARPA aims to develop secure, verified programming methodologies that will redefine how electronic data formats are processed.

By developing format-agnostic, dynamically adaptive, and user-friendly cybersecurity tools, SafeDocs has the potential to redefine the landscape of document security. The outcomes of this initiative have the potential to not only enhance national security but also set new standards for global cybersecurity resilience in the digital age.



















About Rajesh Uppal

Check Also

Harnessing the Power of GPUs for Quantum Computing: A Quantum Leap

Quantum computing is a groundbreaking field with the potential to revolutionize various industries, from cryptography …

error: Content is protected !!