Deep Neural networks (DNN) or large virtual networks of simple information-processing units, which are loosely modeled on the anatomy of the human brain have been responsible for many exciting advances in artificial intelligence in recent years. The deep learning (DL) algorithms allow high-level abstraction from the data, and this is helpful for automatic features extraction and for pattern analysis/classification.
Both training and execution of large-scale DNNs require vast computing resources, leading to high power requirements and communication overhead. The longest and most resource intensive phase of most deep learning implementations is the training phase. This phase can be accomplished in a reasonable amount of time for models with smaller numbers of parameters but as your number increases, your training time does as well. This has a dual cost; your resources are occupied for longer and your team is left waiting, wasting valuable time.
What Is a CPU?
A central processing unit, or CPU, is a processor that processes the basic instructions of a computer, such as arithmetic, logical functions, and I/O operations. It’s typically a small but powerful chip integrated into the computer’s motherboard.
A CPU is considered the computer’s brain because it interprets and executes most of the computer’s hardware and software instructions. Standard components of a CPU include one or more cores, cache, memory management unit (MMU), and the CPU clock and control unit. These all work together to enable the computer to run multiple applications at the same time.
GPPs, also known as “traditional CPUs,” feature ever-improving performance, well-understood, and mature software development tools – and are available in many form factors. While higher transistor counts and frequency increases are brute force approaches to CPU scaling, the limits of those methods are proving to be increasingly cumbersome because of Moore’s law.
The downside of using these GPPs in embedded high-performance applications might be limited product lifespan brought about by end-of-life commercial components or changes in platform support, along with latency issues that are always a concern with real-time applications (particularly vehicle systems or soldier-worn equipment). Meanwhile, environmental concerns can result in thermal issues and reduced quality of service in cold temperatures, and power consumption draws can run high.
While CPUs can perform sequential tasks on complex computations quickly and efficiently, they are less efficient at parallel processing across a wide range of tasks.
Parallel computing is a type of computing architecture in which several processors simultaneously execute multiple, smaller calculations broken down from an overall larger, complex problem.
The core is the central architecture of the CPU where all the computation and logic occur. If we have multiple cores in our processing unit we can split our tasks into multiple smaller tasks and run them at the same time. This will make use of the processing power we have available and complete our tasks much faster. Traditionally, CPUs were single core, but today’s CPUs are multicore, having two or more processors for enhanced performance. A CPU processes tasks sequentially with tasks divided among its multiple cores to achieve multitasking.
CPUs generally have four, eight, or sixteen, while GPUs could have thousands! From here we can conclude that GPU is best suitable for tasks that can be completed simultaneously. Since parallel computing deals with such tasks, we can see why a GPU would be used in that case.
A graphics processing unit (GPU) is a specialized, electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device.
GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel. In a personal computer, a GPU can be present on a video card or embedded on the motherboard. In certain CPUs, they are embedded on the CPU die.
Thus, general purpose GPU (GPGPU) computing was born. GPGPU architecture involves many parallel computing pipelines, each of which can run a small program called a shader. As a customizable program, a shader can perform a wide variety of tasks. NVIDIA has capitalized deeply on this ability for more than a decade with its Compute Unified Device Architecture (CUDA) software platform. NVIDIA®
CUDA® provides an application programming interface (API) for software developers to let programs written in a variety of languages access GPU functions (e.g., C, C++, Fortran, and many more via third-party wrappers).
Modern GPUs use most of their transistors to do calculations related to 3D computer graphics. In addition to the 3D hardware, today’s GPUs include basic 2D acceleration and framebuffer capabilities (usually with a VGA compatibility mode). GPUs were initially used to accelerate the memory-intensive work of texture mapping and rendering polygons, later adding units to accelerate geometric calculations such as the rotation and translation of vertices into different coordinate systems.
Recent developments in GPUs include support for programmable shaders which can manipulate vertices and textures with many of the same operations supported by CPUs, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces. Because most of these computations involve matrix and vector operations, engineers and scientists have increasingly studied the use of GPUs for non-graphical calculations; they are especially suited to other embarrassingly parallel problems.
GPUs were traditionally tasked with compute-intensive, floating-point graphics functions such as 3D rendering and texture mapping. However, some modern GPUs are structured much like parallel-architecture supercomputers and are being used for numerical, signal processing, physics, general scientific, or even statistical applications – all of which might be viable applications on the battlefield.
GPUs are also optimized to perform target tasks, finishing computations faster than non-specialized hardware. These processors enable you to process the same tasks faster and free your CPUs for other tasks. This eliminates bottlenecks created by compute limitations.
Why are GPUs important for Deep Learning?
The major factors accounting for the recent success of deep neural network is the significant leap in the availability of computational processing power. Researchers have been taking advantage of graphical processing units (GPUs), which are small chips designed for high performance in processing the huge amount of visual content needed for video games.
The longest and most resource-intensive phase of most deep learning implementations is the training phase. This phase can be accomplished in a reasonable amount of time for models with smaller numbers of parameters but as your number increases, your training time does as well. This has a dual cost; your resources are occupied for longer and your team is left waiting, wasting valuable time.
Graphical processing units (GPUs) can reduce these costs, enabling you to run models with massive numbers of parameters quickly and efficiently. This is because GPUs enable you to parallelize your training tasks, distributing tasks over clusters of processors and performing compute operations simultaneously.
General Purpose computing on Graphics Processing Units or GPGPU.
GPUs function similarly to CPUs and contain similar components (e.g., cores, memory, etc). They can be integrated into the CPU or they can be discrete (i.e., separate from the CPU with its own RAM).
General-purpose graphics processing (GPGPU) is becoming the cornerstone of digital signal processing in aerospace and defense applications like radar and sonar signal processing, image processing, hyperspectral sensor imaging, signals intelligence, electronic warfare, and persistent surveillance.
With the emergence of deep learning, the importance of GPUs has increased. In research done by Indigo, it was found that while training deep learning neural networks, GPUs can be 250 times faster than CPUs. The explosive growth of Deep Learning in recent years has been attributed to the emergence of general purpose GPUs.
Neural networks have highly parallel architecture and are specifically made for running in parallel. Since they are the base for deep learning, we can conclude that GPUs are perfect for this task. Additionally, neural networks are parallel in such a way that they do not have to depend on each other’s results. Everything could run simultaneously without having to wait for other cores. An example of such computation that is hugely independent is convolution.
Programming tools developed for this purpose, essentially extensions of the ubiquitous high-level C as well as C++ (and recently Fortran programming languages), leverage GPU parallel compute engines to solve complex computational problems. These computations include largely parallelizable problems, which can be solved in significantly shorter timeframes by the GPU – in some cases 100x faster – than by a traditional CPU. This computing paradigm is called General Purpose computing on Graphics Processing Units or GPGPU.
Additionally, GPUs are available in extended temperature and rugged packages, making them suitable for deployment on airborne or other environmentally challenging platforms. The projected GPU lifespan can be limited, but with careful material planning, this can be managed. As with GPPs, care must also be used with power management and heat dissipation, particularly with small form factor systems.
There has been some level of competition in this area with ASICs, most prominently the Tensor Processing Unit (TPU) made by Google. However, ASICs require changes to existing code and GPUs are still very popular.
TPUs are chip or cloud-based, application-specific integrated circuits (ASIC) for deep learning. These units are specifically designed for use with TensorFlow and are available only on Google Cloud Platform.
Each TPU can provide up to 420 teraflops of performance and 128 GB high bandwidth memory (HBM). There are also pod versions available that can provide over 100 petaflops of performance, 32TB HBM, and a 2D toroidal mesh network.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core processors, Bill Jenkins of Intel suggests that significant challenges remain in power, cost and, performance scaling. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks, he believes, because they combine computing, logic, and memory resources in a single device. Reportedly, Microsoft is also using field-programmable gate arrays (FPGAs), which provide the benefit of being reconfigurable if the computing requirements change. The Nervana Engine, an ASIC deep-learning accelerator, will go into production in early to mid-2017.
Scott Leishman, a computer scientist at Nervana, notes that another computationally intensive task—bitcoin mining—went from being run on CPUs to GPUs to FPGAs and, finally, on ASICs because of the gains in power efficiency from such customization. “I see the same thing happening for deep learning,” he says. Researchers are also developing neuromorphic chips based on silicon photonics and memristors.
Graphics Processing Unit (GPU) Market
The graphics processing unit (GPU) market size is expected to grow by USD 105.70 Billion from 2021 to 2026, at a CAGR of 32.35% during the forecast period, according to Technavio.
Advanced Micro Devices Inc., Apple Inc., Arm Ltd., ASUSTeK Computer Inc., Broadcom Inc., EVGA Corp., Fujitsu Ltd., Galaxy Microsystems Ltd., Gigabyte Technology Co. Ltd., Imagination Technologies Ltd, Intel Corp., International Business Machines Corp., NVIDIA Corp., Qualcomm Inc., Samsung Electronics Co. Ltd., SAPPHIRE Technology Ltd., Zotac Technology Ltd., and Taiwan Semiconductor Manufacturing Co. Ltd. are some of the major market participants.
Intel and Nvidia are certainly the world’s most powerful and eminent ones. Intel is unparalleled in terms of integrated graphics, while Nvidia makes discrete graphics cards like none other.
Nvidia’s GeForce 256 is considered the world’s first GPU. Intel snags 62% of GPU sales, thanks to its integrated graphics. Nvidia’s market share is at 18%. Its graphic cards make for almost 90% of all its profits.
The notorious GPU shortage started in late 2020 and is still ongoing. The shortage is also responsible for graphics cards scalping. Nvidia has more than 370 partnerships revolving around self-driving cars. The best-performing GPU right now is the Nvidia GeForce RTX 3090 TI.
AIRI//S™ is modern AI infrastructure architected by Pure Storage® and NVIDIA and powered by the latest NVIDIA DGX systems and Pure Storage FlashBlade//S™. AIRI//S is an out-of-the-box AI solution that simplifies your AI deployment to deliver simple, fast, next-generation, future-proof infrastructure to meet your AI demands at any scale.
Ability to interconnect GPUs
When choosing a GPU, you need to consider which units can be interconnected. Interconnecting GPUs is directly tied to the scalability of your implementation and the ability to use multi-GPU and distributed training strategies.
Typically, consumer GPUs do not support interconnection (NVlink for GPU interconnects within a server, and Infiniband/RoCE for linking GPUs across servers) and NVIDIA has removed interconnections on GPUs below RTX 2080.
NVIDIA GPUs are the best supported in terms of machine learning libraries and integration with common frameworks, such as PyTorch or TensorFlow. The NVIDIA CUDA toolkit includes GPU-accelerated libraries, a C and C++ compiler and runtime, and optimization and debugging tools. It enables you to get started right away without worrying about building custom integrations.
NVIDIA’s New H100 GPU Smashes Artificial Intelligence Benchmarking Records
The NVIDIA H100 is NVIDIA’s ninth generation data center GPU. Compared to NVIDIA’s previous generation, A100 GPU, the H100 delivers orders-of-magnitude better performance for AI and HPC at large.
In the data center category, the NVIDIA H100 Tensor Core GPU delivered the highest per-accelerator performance across each workload for both server and offline tests. Its performance was 4.5x higher in the offline scenario and 3.9x higher than the A100 Tensor Core GPU in the server scenario.
NVIDIA attributes the H100’s superior performance over the BERT NLP model to its Transformer engine. The new engine, combined with NVIDIA Hopper FP8 Tensor Cores, delivers 9x faster AI training and 30x faster AI inference speed on larger language models than prior generations.
Speed is important because huge AI models can contain trillions of parameters. The models are so large that it can take months to train one with such an amount of data. NVIDIA’s Transformer Engine provides additional speed by using 16-bit floating-point precision and a new 8-bit floating-point data format that increases Tensor Core throughput by 2x and memory requirements compared to 16-bit floating-point Reduces 2x.
Those improvements, as well as advanced Hopper software algorithms, accelerate AI performance and capabilities and allow it to train models within days or hours rather than months. The faster a model operates, the sooner its ROI starts to return, and operational improvements can be implemented.
New GSC6204 GPU designed for aerospace and defense
Mercury Systems, Inc., company specializing in secure mission-critical technologies for aerospace and defense, unveiled the new GSC6204 OpenVPX 6U NVIDIA Turing architecture-based graphics processing unit (GPU) co-processing engine, aiming to provide accelerated high-performance computing capabilities to commercial aerospace and defense applications.
Compute-intensive artificial intelligence (AI), radar, electro-optical/infrared imagery, cognitive electronic warfare and sensor fusion applications require high-performance computing capabilities closer to the sensor for effectiveness. To address this need, the GSC6204 module incorporates the NVIDIA Turing GPU architecture aiming to bring the latest advancements in processing and scale to the embedded domain.
Powered by dual NVIDIA Quadro TU104 processors and incorporating NVIDIA’s NVLink high-speed direct GPU-to-GPU interconnect technology, the module is designed to deliver the same massive parallel processing capability found in data centers.
Combined with Mercury’s HDS6605 Intel Xeon Scalable server blade, SCM6010 fast storage, SFM6126 wideband PCIe switches, streaming IOM-400 I/O modules, and ruggedized to withstand environmental extremes, these GPU co-processing engines are intended to be a critical component of a composable high-performance embedded edge compute (HPEEC) environment.
Rugged GPGPU-based embedded computing system for artificial intelligence (AI) uses introduced by Aitech
Aitech Defense Systems in Chatsworth, Calif., is introducing the upgraded and qualified version of the A178 rugged general-purpose graphics processing (GPGPU) AI embedded supercomputer for intense data processing in extreme environments. The A178 operates reliably in mobile, remote, military, and autonomous systems, and is for applications like training simulation, situational awareness, artificial intelligence (AI), image and video processing, and moving maps.
One of the smallest of Aitech’s small-form-factor (SFF) embedded computing systems, the A178 uses the NVIDIA Jetson AGX Xavier system-on-module that features the Volta GPU with 512 CUDA cores and 64 Tensor cores to reach 32 TOPS INT8 and 11 TFLOPS FP16. Upgrades help meet the demand for standalone and compact GPGPU-based systems that are rugged and SWaP-C-optimized. The low-power unit offers energy efficiency, while providing all the power necessary for AI-based local processing.
The advanced computation abilities of the system include two dedicated NVIDIA Deep-Learning Accelerator (NVDLA) engines that provide an interface for deep learning applications. The system can accommodate as many as three expansion modules, such as an HD-SDI frame grabber, composite frame grabber, or NVMe solid-state drive. Four high-definition HD-SDI inputs and eight composite inputs handle several streams of video and data simultaneously at full frame rates. Interfaces include Gigabit and 10 Gigabit Ethernet, DisplayPort output handling 4K resolution, USB 3.0 and 2.0, DVI/HDMI output, UART serial, and CANbus.
References and Resources also include: