The world of high-performance computing is particularly challenging, as new flavors of sophisticated sensors, complex cameras, and the requirements of Advanced computations, which include Software-Defined Radio, cryptography, and other types of arithmetic-intensive algorithms. GPPs, also known as “traditional CPUs,” feature ever-improving performance, well-understood, and mature software development tools – and are available in many form factors. While higher transistor counts and frequency increases are brute force approaches to CPU scaling, the limits of those methods are proving to be increasingly cumbersome.
The downside of using these GPPs in embedded high-performance applications might be limited product lifespan brought about by end-of-life commercial components or changes in platform support, along with latency issues that are always a concern with real-time applications (particularly vehicle systems or soldier-worn equipment). Meanwhile, environmental concerns can result in thermal issues and reduced quality of service in cold temperatures, and power consumption draws can run high.
A graphics processing unit (GPU) is a specialized, electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel. In a personal computer, a GPU can be present on a video card or embedded on the motherboard. In certain CPUs, they are embedded on the CPU die.
Thus, general purpose GPU (GPGPU) computing was born. GPGPU architecture involves many parallel computing pipelines, each of which can run a small program called a shader. As a customizable program, a shader can perform a wide variety of tasks. NVIDIA has capitalized deeply on this ability for more than a decade with its Compute Unified Device Architecture (CUDA) software platform. NVIDIA®
CUDA® provides an application programming interface (API) for software developers to let programs written in a variety of languages access GPU functions (e.g., C, C++, Fortran, and many more via third-party wrappers).
Modern GPUs use most of their transistors to do calculations related to 3D computer graphics. In addition to the 3D hardware, today’s GPUs include basic 2D acceleration and framebuffer capabilities (usually with a VGA compatibility mode). GPUs were initially used to accelerate the memory-intensive work of texture mapping and rendering polygons, later adding units to accelerate geometric calculations such as the rotation and translation of vertices into different coordinate systems. Recent developments in GPUs include support for programmable shaders which can manipulate vertices and textures with many of the same operations supported by CPUs, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces. Because most of these computations involve matrix and vector operations, engineers and scientists have increasingly studied the use of GPUs for non-graphical calculations; they are especially suited to other embarrassingly parallel problems.
GPUs were traditionally tasked with compute-intensive, floating-point graphics functions such as 3D rendering and texture mapping. However, some modern GPUs are structured much like parallel-architecture supercomputers and are being used for numerical, signal processing, physics, general scientific, or even statistical applications – all of which might be viable applications on the battlefield. General-purpose graphics processing (GPGPU) is becoming the cornerstone of digital signal processing in aerospace and defense applications like radar and sonar signal processing, image processing, hyperspectral sensor imaging, signals intelligence, electronic warfare, and persistent surveillance.
Deep Neural networks (DNN) or large virtual networks of simple information-processing units, which are loosely modeled on the anatomy of the human brain have been responsible for many exciting advances in artificial intelligence in recent years. The deep learning (DL) algorithms allow high-level abstraction from the data, and this is helpful for automatic features extraction and for pattern analysis/classification.
Both training and execution of large-scale DNNs require vast computing resources, leading to high power requirements and communication overhead. Scott Leishman, a computer scientist at Nervana, notes that another computationally intensive task—bitcoin mining—went from being run on CPUs to GPUs to FPGAs and, finally, on ASICs because of the gains in power efficiency from such customization. “I see the same thing happening for deep learning,” he says. Researchers are also developing neuromorphic chips based on silicon photonics and memristors.
The major factors accounting for the recent success of deep neural network is the significant leap in the availability of computational processing power. Researchers have been taking advantage of graphical processing units (GPUs), which are small chips designed for high performance in processing the huge amount of visual content needed for video games. With the emergence of deep learning, the importance of GPUs has increased. In research done by Indigo, it was found that while training deep learning neural networks, GPUs can be 250 times faster than CPUs. The explosive growth of Deep Learning in recent years has been attributed to the emergence of general purpose GPUs. There has been some level of competition in this area with ASICs, most prominently the Tensor Processing Unit (TPU) made by Google. However, ASICs require changes to existing code and GPUs are still very popular.
Programming tools developed for this purpose, essentially extensions of the ubiquitous high-level C as well as C++ (and recently Fortran programming languages), leverage GPU parallel compute engines to solve complex computational problems. These computations include largely parallelizable problems, which can be solved in significantly shorter timeframes by the GPU – in some cases 100x faster – than by a traditional CPU. This computing paradigm is called General Purpose computing on Graphics Processing Units or GPGPU.
Additionally, GPUs are available in extended temperature and rugged packages, making them suitable for deployment on airborne or other environmentally challenging platforms. The projected GPU lifespan can be limited, but with careful material planning, this can be managed. As with GPPs, care must also be used with power management and heat dissipation, particularly with small form factor systems.
Many companies have produced GPUs under a number of brand names. In 2009, Intel, Nvidia and AMD/ATI were the market share leaders, with 49.4%, 27.8% and 20.6% market share respectively. However, those numbers include Intel’s integrated graphics solutions as GPUs. Not counting those, Nvidia and AMD control nearly 100% of the market as of 2018. Their respective market shares are 66% and 33%. In addition, S3 Graphics and Matrox produce GPUs. Modern smartphones also use mostly Adreno GPUs from Qualcomm, PowerVR GPUs from Imagination Technologies and Mali GPUs from ARM.
Nvidia new AI brain has eight Pascal GPUs, 7TB of solid state memory, and needs 3,200 watts
The foremost proponent of Graphics processing units, or GPUs which can perform many mathematical operations in parallel is Nvidia . “Nvidia announced a new chip called the Tesla P100 that’s designed to put more power behind a technique called deep learning.” The company invested $2 billion in research and development (R&D) to design a graphics-processing architecture, as the company stated, “dedicated to accelerating AI and to accelerating deep learning.”
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core processors, Bill Jenkins of Intel suggests that significant challenges remain in power, cost and, performance scaling. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks, he believes, because they combine computing, logic, and memory resources in a single device. Reportedly, Microsoft is also using field-programmable gate arrays (FPGAs), which provide the benefit of being reconfigurable if the computing requirements change. The Nervana Engine, an ASIC deep-learning accelerator, will go into production in early to mid-2017.
The Tesla P100 is a Pascal-based chip that packs 150 billion transistors into a 16 nanometer FinFET chip, resulting in an impressive 5.3 teraflops of performance. It also large memory bandwidth thanks to its use of High Bandwidth Memory 2, and the P100 is the first to feature the tech.
They’re are built specifically for massive deep learning networks, and has massive implications on everything from cloud networks, to social media, to self-driving cards and autonomous robots.Yet a massive supercomputing cluster consisting of 140,000 processing units still performs 83 times slower than a cat’s brain, said Wei Lu, a computer engineer at the University of Michigan.
Wu staged contest between TrueNorth and a high-powered Nvidia computer called the Jetson TX-1, that used different implementations of neural-network-based image-processing software to try and distinguish 10 classes of military and civilian vehicle represented in a public data set called MSTAR. Examples included Russian T-72 tanks, armored personnel carriers, and bulldozers. Both systems achieved about 95 percent accuracy, but the IBM chip used between a 20th and a 30th as much power.
New GSC6204 GPU designed for aerospace and defense
Mercury Systems, Inc., company specializing in secure mission-critical technologies for aerospace and defense, unveiled the new GSC6204 OpenVPX 6U NVIDIA Turing architecture-based graphics processing unit (GPU) co-processing engine, aiming to provide accelerated high-performance computing capabilities to commercial aerospace and defense applications.
Compute-intensive artificial intelligence (AI), radar, electro-optical/infrared imagery, cognitive electronic warfare and sensor fusion applications require high-performance computing capabilities closer to the sensor for effectiveness. To address this need, the GSC6204 module incorporates the NVIDIA Turing GPU architecture aiming to bring the latest advancements in processing and scale to the embedded domain.
Powered by dual NVIDIA Quadro TU104 processors and incorporating NVIDIA’s NVLink high-speed direct GPU-to-GPU interconnect technology, the module is designed to deliver the same massive parallel processing capability found in data centers.
Combined with Mercury’s HDS6605 Intel Xeon Scalable server blade, SCM6010 fast storage, SFM6126 wideband PCIe switches, streaming IOM-400 I/O modules, and ruggedized to withstand environmental extremes, these GPU co-processing engines are intended to be a critical component of a composable high-performance embedded edge compute (HPEEC) environment.
Rugged GPGPU-based embedded computing system for artificial intelligence (AI) uses introduced by Aitech
Aitech Defense Systems in Chatsworth, Calif., is introducing the upgraded and qualified version of the A178 rugged general-purpose graphics processing (GPGPU) AI embedded supercomputer for intense data processing in extreme environments. The A178 operates reliably in mobile, remote, military, and autonomous systems, and is for applications like training simulation, situational awareness, artificial intelligence (AI), image and video processing, and moving maps.
One of the smallest of Aitech’s small-form-factor (SFF) embedded computing systems, the A178 uses the NVIDIA Jetson AGX Xavier system-on-module that features the Volta GPU with 512 CUDA cores and 64 Tensor cores to reach 32 TOPS INT8 and 11 TFLOPS FP16. Upgrades help meet the demand for standalone and compact GPGPU-based systems that are rugged and SWaP-C-optimized. The low-power unit offers energy efficiency, while providing all the power necessary for AI-based local processing.
The advanced computation abilities of the system include two dedicated NVIDIA Deep-Learning Accelerator (NVDLA) engines that provide an interface for deep learning applications. The system can accommodate as many as three expansion modules, such as an HD-SDI frame grabber, composite frame grabber, or NVMe solid-state drive. Four high-definition HD-SDI inputs and eight composite inputs handle several streams of video and data simultaneously at full frame rates. Interfaces include Gigabit and 10 Gigabit Ethernet, DisplayPort output handling 4K resolution, USB 3.0 and 2.0, DVI/HDMI output, UART serial, and CANbus.
References and Resources also include: