Computing performance has steadily increased against the trajectory set by Moore’s Law, and networking performance has accelerated at a similar rate. Despite these connected evolutions in network and server technology however, the network stack, starting with the network interface card (NIC) – or the hardware that bridges the network/server boundary – has not kept pace.
Today, network interface hardware is hampering data ingest from the network to processing hardware. At the physical layer, the upper bound for server throughput is imposed by the network interface hardware that connects a machine to a communications network, limiting a processor’s data ingest capability.
The differences in data rates at different points in the path from a remote server to a local server illustrate the need for breakthrough approaches in FastNICs. A single optical fiber can (in aggregate) carry about 100 terabits per second of data traffic. Today’s multicore multiprocessors, graphic processing unit (GPU)-equipped servers, and similar computing nodes can (in aggregate) process data at a roughly similar rate. Both are severely limited by the network interface, which typically operate at rates that are 100x – 1000x slower.
“The true bottleneck for processor throughput is the network interface used to connect a machine to an external network, such as an Ethernet, therefore severely limiting a processor’s data ingest capability,” said Dr. Jonathan Smith, a program manager in DARPA’s Information Innovation Office (I2O). “Today, network throughput on state-of-the-art technology is about 10 exp14 bits per second (bps) and data is processed in aggregate at about 10 exp14 bps. Current stacks deliver only about 10exp10 to 10exp11 bps application throughputs.”
Additional factors, such as limitations in server memory technologies, memory copying, poor application design, and competition for shared resources, has resulted in network subsystems that are creating a bottleneck within the network stack and are throttling application throughput.
Addressing the bottleneck between multiprocessor servers and the network links that interconnect them is increasingly critical for distributed computing. This class of computing requires significant communication between computation nodes. It is also increasingly relied on for advanced applications such as deep neural network training and image classification.
To accelerate distributed applications and close the yawning performance gap, DARPA initiated the Fast Network Interface Cards (FastNICs) program. FastNICs seeks to improve network stack performance by a factor of 100 through the creation of clean-slate networking approaches. Enabling this significant performance gain will require a rework of the entire network stack – from the application layer through the system software layer, down to the hardware.
“There is a lot of expense and complexity involved in building a network stack – from maximizing connections across hardware and software to reworking the application interfaces. Strong commercial incentives focused on cautious incremental technology advances across multiple, independent market silos have dissuaded anyone from addressing the stack as a whole,” said Smith.
FastNICs will also explore applications that could be enabled by the multiple order of magnitude performance increases provided by the program-generated hardware. Researchers will aim to design and implement at least one application that demonstrates a 100x speedup when executed on the novel hardware/software stack, providing a validator for the program’s primary objective.
There are two application areas of particular interest – distributed machine learning and sensors. Machine learning requires the harnessing of clusters – or large numbers of machines – so that all cores are employed for a single purpose, like analyzing imagery to help self-driving cars appropriately identify an obstacle in the road. “Recent research has shown that by speeding up the network support, the entire distributed machine learning system can operate more quickly. With machine learning, the methods typically used involve moving data around, which creates delays. However, if you can move data more quickly between machines with a successful FastNICs result then you should be able to shrink the performance gap,” said Smith.
FastNICs will also explore sensor data from systems like UAVs and overhead imagers. An example application would be change detection where tagged images are used to train a deep learning system to recognize anomalies in a time series of image captures, such as the presence of a strange structure, or a sudden spurt in activity at facilities in an inexplicable location. Change detection requires quick access to both current sensor data as well as the ability to rapidly access archives of data. FastNICs will provide a way of accelerating the acquisition of actionable intelligence from a mountain of data.
FastNICs will speed up applications such as the distributed training of machine learning classifiers by 100x through the development, implementation, integration, and validation of novel, clean-slate network subsystems. The program will focus on overcoming the gross mismatches in computing and network subsystem performance. Specifically, computer network interface performance lags the performance of other computer subsystems (RAM, CPU, etc.) by 3 to 4 orders of magnitude.
To help justify the need for this significant overhaul, the FastNICs programs will select a challenge application and provide it with the hardware support it needs, operating system software, and application interfaces that will enable an overall system acceleration that comes from having faster NICs. Under the program, researchers will work to develop, implement, integrate, and validate novel, clean-slate network subsystems.
Part of FastNICs will focus on developing hardware systems to significantly improve aggregate raw server datapath speed. Within this research area, researchers will design, implement, and demonstrate 10 Tbps network interface hardware using existing or road-mapped hardware interfaces. The hardware solutions must attach to servers via one or more industry-standard interface points, such as I/O buses, multiprocessor interconnection networks, and memory slots, to support the rapid transition of FastNICs technology. “It starts with the hardware; if you cannot get that right, you are stuck. Software can’t make things faster than the physical layer will allow so we have to first change the physical layer,” said Smith.
A second research area will focus on developing system software required to manage the FastNICs hardware resources. To realize 100x throughput gains at the application level, system software must enable efficient and parallel transfer of data between the network hardware and other elements of the system. FastNICs researchers will work to generate software libraries – all of which will be open source, and compatible with at least one open source OS – that are usable by various applications.
Technical Areas (TAs)
Layers serve to divide a system into components with logically distinct roles, with interactions between layers via well-defined interfaces.
FastNICs is structured into three technical areas:
(1) TA1 Network Subsystem Hardware and Software,
(2) TA2 Applications
(3) TA3 Independent Test and Evaluation.
TA1 and TA2 comprise the layers of the FastNICs network stack. The network subsystem technical area, TA1, is further divided into network interface hardware (TA1.1) and system software support (TA1.2).
TA2, the applications technical area, will explore new applications enabled by the multiple order of magnitude performance increases provided by TA1, and validate the FastNICs objective of increasing application performance.
Applications proposed for TA2 research should be relevant to DoD and should fit broadly into either machine learning training or image processing or both. An example of relevant training is using tagged images to train a deep learning system to recognize anomalies in a time series of image captures, such as the presence of a strange structure, or a sudden spurt in activity at facilities in an inexplicable location.
Algorithmic advances in distributed machine learning training have been achieved under the assumption that the performance limitations inherent in today’s network stacks will persist indefinitely. The scale of computational challenges facing the DoD requires multiple servers, forcing training of machine learning to become a distributed computation to achieve the required performance. As these computations are often very data-intensive, restructuring distributed training to reduce network use, e.g., by data compression or use of more iterations to avoid passing data, is common.
Distributed training is thus an exemplary application of FastNICs. Applications of the results of such training, such as the analysis of multiple large-scale data streams, provide compelling cases for transition partners. Many other large-scale applications of interest exist, examples of which include pattern recognition using multiple concurrent streams of data, data generated from arrays of sensors that must be cohered for situational awareness, and real-time fusion with joint analysis of sensor feeds and stored data retrieved from multiple remote repositories.
TA1.2 Network Subsystem Software
FastNICs TA 1.2 will develop the system software required to manage TA1.1 hardware, as well as enable the efficient transit of data to and from processing resources in support of TA 2 applications research. Responsive TA1.2 research proposals will incorporate both novel resource management approaches and novel programming models and interfaces.
TA1.2 must also provide programming interfaces enabling TA 2 to unleash the system’s potential performance. TA 1.2 will collaborate closely with TA 2 while developing the interfaces, APIs, and programming tools
Network Subsystem Hardware
FastNICs TA1.1 will focus on developing hardware systems to significantly improve aggregate raw server datapath speed. This TA will design, implement, and demonstrate 10Tbps network interface hardware using existing or road-mapped hardware interfaces. TA1.1 computing nodes should reflect the architecture and performance of commercially-available multiprocessor servers. Such servers comprise a set of multicore CPU chips linked by a non-uniform memory access (NUMA) interconnect.
Compute nodes may also include many core or GPU computing elements, typically attached via a bus intended for peripherals. TA 1.1 solutions must attach to servers via one or more industry-standard interface points such as I/O buses, multiprocessor interconnection networks, and memory slots. Use of industry-standard interface points is essential for rapid technology transition and industry adoption of FastNICs TA1.1 research results. Further, this requirement ensures that the results of FastNICs TA 1.1 research will persist across multiple generations of network and server technology.