Home / Cyber / IARPA using Big Data Analytics to automatically discover Signals from Web Sources to Predict Cyber Attacks

IARPA using Big Data Analytics to automatically discover Signals from Web Sources to Predict Cyber Attacks

Cyber crimes and attacks continue to expand exponentially and involving more and more advanced and sophisticated techniques to infiltrate corporate networks and enterprise systems. Types of attacks include advanced malware, zero day attacks and advanced persistent threats. Flawed software, the root of most program errors and security vulnerabilities, is a critical enabler of cyber-crime. Estimated to cost the global economy $445 billion per year, cyber-crime impacts individuals, businesses, and national economies, and it causes devastating consequences for those affected.


Cybersecurity solutions have traditionally been based on signatures, relying on matches to patterns identified with previously identified malware to capture attacks in real time. Intrusion Prevention System (IPS) and Next-generation Firewall (NGFW) perimeter security solutions inspect network traffic for matches with a signature that has been created in response to analysis of specific malware samples. However, Minor changes to malware reduce the IPS and NGFW efficacy.



However, new methods identify the malware through the observation of their abnormal, post-infection, behavior. Identifying abnormal behavior requires primarily the capability of first identifying what’s normal and then use rigorous analytical methods – data science – to identify anomalies. The fundamental transition from signatures to behavior for malware identification is the most important enabler of applying data science to cybersecurity.


Big data analytics has the ability to gather massive amounts of digital information to analyze, visualize and draw insights that can make it possible to predict and stop cyber attacks. Research firm Gartner said that big data analytics will play a crucial role in detecting crime and security infractions.


“Automated cybersecurity is the future,” predicts Mike Walker, a computer scientist at Defense Advanced Research Projects Agency who specializes in machine learning. “Imagine a future where you have the job of managing a network, and the scenario where a A.I. monitors security of your network.” One day you might receive an urgent text from that A.I. system: I detected a zero-day flaw used to breach a work station. I wrote and deployed a patch in 20 seconds, could you please come help? “It may be hard to believe, but I believe it is coming,” Walker said. Walker presented the results of DARPA’s recent Grand Cyber Challenge, which he launched in 2013. The event offered proof that machines could hunt and patch bugs on their own, a revelation that could prove revolutionary.


IARPA’s CAUSE program aims to develop and test new automated methods that forecast and detect cyber-attacks significantly earlier than existing methods.  Leidos has won a prime contract from the Intelligence Advanced Research Projects Activity (IARPA) to research and develop multi-disciplinary methods that provide accurate and timely cyberattack forecasts under the Cyberattack Automated Unconventional Sensor Environment (CAUSE) program.


IARPA’s Cyber-attack Automated Unconventional Sensor Environment (CAUSE)

The U.S. Intelligence Advanced Research Projects Activity (IARPA) has awarded its $11.4 million Cyber-attack Automated Unconventional Sensor Environment (CAUSE) program to BAE Systems. The CAUSE program seeks multi-disciplinary unconventional sensor technology that will complement existing advanced intrusion detection capabilities.


“Past research, such as IARPA’s Open Source Indicators (OSI) program, shows that combinations of publicly available data sources are useful in the early and accurate detection and forecasting of events, such as disease outbreaks and political crises. In the area of cybersecurity, few have researched methods for a probabilistic warning system that fuses internal sensors (sensors inside the logical boundary of an organization, such as host data) and external sensors (sensors outside the logical and physical boundaries of an organization, such as social media or web search trends),” says IARPA.


The CAUSE program will develop “predictive methods that combine existing advanced intrusion detection capabilities with unconventional publicly available data sources, leveraging sources not usually associated with cybersecurity,” BAE announced. “Researchers will seek to identify leading indicators of an attack from vast, noisy external streams of data and then correlate related data from different sources to generate accurate, actionable warnings.” CAUSE seeks to draw data from ‘noisy’ sources such as Twitter, and it will seek to correlate that data with more reliable sources before drawing predictive conclusions.


Rebecca Cathey, BAE’s Principal Investigator, explained how it will work. “Our system applies human behavioral, cyber attack, and social theories to publicly available information to develop unconventional sensors of activities indicative of the early stages of an attack. The sensors search for signals including emotional language, sentiment, and topics of conversation. The sensor outputs will be fused together using models seeded with expert knowledge to predict the likelihood of cyber attacks against specific targets. This differs from traditional cyber attack detection, which utilizes conventional sensors running with private data, where the focus is on detection of an ongoing event, rather than prediction. Our sensors will use a wide variety of techniques and algorithms to mine a graphical representation of the data.”

EFFECT developed under CAUSE can Predict Cyber attacks

One approach to combating cyber threats is to develop technologies that anticipate them before an actual cyber attack occurs. The intuition behind this forecasting approach is the following. Cyber attacks do not occur in a vacuum. To conduct a cyber attack, hackers first have to choose a target, identify the attack surface (i.e, vulnerabilities in the target’s software and hardware infrastructure), acquire the necessary exploits, malware and expertise to use them, and potentially recruit other participants. Other actors—system administrators, security analysts, and even victims—may discuss vulnerabilities or coordinate a response to exploits. These activities are often conducted online, leaving a variety of digital traces that can be mined to extract signals of pending attacks well before suspicious activity is noted on the target system.


Identifying useful signals of impending cyber attacks poses several research challenges. First, while some of the data relating to activities of cyber actors is openly available, malicious actors often obfuscate their actions using anonymized and encrypted Internet protocols. Second, the behavioral processes generating activities of interest are likely to be weak, sparse, and transient, posing significant challenges to picking them out from among massive quantities of entirely innocuous activity. Finally, translating the signals to generate a warning about a cyber attack presents yet another challenge.


Under the IARPA-funded CAUSE program, USC Information Sciences Institute has developed an end-to-end prototype, called EFFECT, to forecast emerging cyber threats. This paper describes two machine learning methods for time series prediction that are used by EFFECT to forecast cyber attacks. The methods take as input historical data to learn a model of cyber attacks. These models capture patterns present in historical data that help forecast new cyber attacks. The authors show that they can improve the predictions of these baseline models by leveraging signals from external Web data sources.



Data Sources

To construct external signals, EFFECT harvests data from a variety of sources, including vulnerability databases, malicious email and malware trackers, but also from sources not conventionally used in security applications, such as social media, blogs and darkweb forums. From these data sources they extract a variety of time series, each representing the number of daily occurrences of cyber security-related terms. The time series are used as external signals in the forecasting task.


Dark/deep web Deep and Dark (D2Web) web are non-indexed sites on the open Internet which are accessed using anonymization protocols (most notably the TOR protocol). These web sites host discussion forums and marketplaces, which are often used for malicious or illicit purposes, for example, to buy and sell drugs, guns, hacked data or exploits. It has been shown that the activity on these websites can signal potential cyber attacks.

Twitter: It has been shown that discussions on social media can be used as signals for detecting cyber threats. Therefore we collected tweets which where either posted by security experts or contained cyber security related keywords

Vulnerability Database: Software vulnerabilities are often exploited by malicious actors in cyber attacks. National Vulnerability Database
(NVD) is the largest publicly available repository which contains information about reported vulnerabilities in software. Authors collected vulnerabilities for different software products to evaluate their role in predicting cyber attacks.

Honeypots: A honeypot is a security resource with the goal of having a system probed and attacked. Traffic reaching honeypots may be malicious and can provide a window into hacker activities. “We collected data from a network of ten honeypots deployed by the EFFECT team specifically, the number of queries received daily by each honeypot serves as an external signal in the cyber forecasting task. The honeypots
were deployed and data was collected starting in October 2017.


Forecasting Task

To train the forecasting models, and to evaluate their predictions, they use the ground truth data about cyber attacks provided through the CAUSE program from two companies. The ground truth data comprise of attacks intercepted at both organizations, which correspond to three types of events:

  • An endpoint-malware attack is recorded when antivirus software used by the organization finds malware installed on end-user’s system.
  • A malicious-email attack is the receipt of an email that contains a malicious email attachment and/or a link to a known malicious destination.
  • A malicious-destination attack is recorded when enduser clicks on a malicious URL.


Given a time series describing observed events in the ground truth data, our goal is to use this information, plus information from external signals, to predict new events occurring during some future forecasting time span. The prediction model is trained on the historical GT data and external signals.


“We describe machine learning techniques based on deep neural networks and autoregressive time series models that leverage external signals from publicly available Web sources to forecast cyber attacks. Performance of our framework across ground truth data over real-world forecasting tasks shows that our methods yield a significant lift or increase of F1 for the top signals on predicted cyber attacks. Our results suggest that, when deployed, our system will be able to provide an effective line of defense against various types of targeted cyber attacks.”


Department of Homeland Security (DHS) Cyber Apex initiative

The US Department of Homeland Security (DHS) is fusing traditional cyber-defense methods with the real-time intelligence rendered by robust big data analytics for improving the cybersecurity of federal agencies and key private industries.


Under its Cyber Apex initiative, it aims to detect the presence of a cyberthreat without necessarily relying on a known cyber-signature. “This program seeks to provide the tools and technology to enable critical infrastructure owners and operators to repair and patch compromised networks without having to take the systems offline to do so,” Dr. Schneck said while updating the NSTAC.


Dr. Schneck also elaborated on the Department’s concept referred to as Weather Map. She said that DHS is collecting data from various Government agencies to perform a mathematical trend analysis of cyber events. Weather mapping allow users and potential victims to see where threats might occur and warn users of potentially risky situations so that users can reject incoming traffic in real-time.


DHS officials hope “weather map” can do for cyberthreats what weather satellites, meteorologists and data analysts at the National Weather Service have done for years predicting climate threats, said Phyllis Schneck, deputy undersecretary for cybersecurity and communications for DHS’ National Protection and Programs Directorate. “This concept comprises the ability to view the current state of cybersecurity, just as a traditional weather map provides the view of current weather,” Schneck told the committee.


In coming years, the tools of data analysis will evolve further to enable a number of advanced predictive capabilities and automated controls in real time.


References and Resources also include:

About Rajesh Uppal

Check Also

Hacking of Electronic Voting Machines: A Threat to Democracy and the Need for a Secure Voting System

Introduction In a world increasingly driven by technology, electronic voting machines (EVMs) have become a …

error: Content is protected !!