Challenges of exascale to zettascale computing and technology breakthroughs enabling them

Rajesh Uppal April 4, 2021 AI & IT, Critical & Emerging Technologies Comments Off on Challenges of exascale to zettascale computing and technology breakthroughs enabling them 948 Views

Science’s computing needs are growing exponentially. The next big leap in scientific computing is the race to exascale capability which computes one thousand petaflops per second that is capable of performing 1 million trillion floating-point operations per second (1 exaflops). Currently the fastest systems in the world perform between ten and 93 petaflops, or roughly one to nine percent the speed of exascale.

The increase in the power of computers has long followed Moore’s Law, named after Intel cofounder Gordon Moore, who observed in 1965 that the processing power of computer chips was doubling roughly every two years . Supercomputers shifted from being able to do thousands of operations per second to millions, then billions, then trillions per second, at a cadence of roughly a thousand-fold increase in ability per decade.

China, US, Japan and Europe are in global race for building the first exascale supercomputer by 2020- 2023. Multiple countries are competing to get to exascale first. The United States aims to have Aurora operational sometime in 2021. US Department of Energy’s (DOE) Argonne National Laboratory in Lemont, IL, will power up a calculating machine the size of 10 tennis courts and vault the country into a new age of computing. The $500-million mainframe, called Aurora, could become the world’s first “exascale” supercomputer, running an astounding 1018, or 1 quintillion, operations per second. Aurora is expected to have more than twice the peak performance of the current supercomputer record holder, a machine named Fugaku at the RIKEN Center for Computational Science in Kobe, Japan.

China has said it would have an exascale machine by the end of 2020, although experts outside the country have expressed doubts about this timeframe even before the delays caused by the global severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic. The decision to develop domestically-produced processors for these systems and the inclusion of new application use cases appears to be stretching out the timelines. Engineers in Japan and the European Union are not far behind. “Everyone’s racing to exascale,” says. France has also revealed specific plan recently. None of the efforts is expected to produce a sustained exascale machine until 2021, sustained exascale being defined as one exaflop of 64-bit performance on a real application.

Exascale computing Challenges

However, exascale supercomputing itself faces a number of major challenges, including technology, architecture, power, reliability, programmability, and usability. Other challenges are developing the architecture and interconnects to efficiently weave together hundreds of thousands of processors and memory chips; and devising an operating system and client software that actually scales to one quintillion calculations per second.

These supercomputers are complex beasts, consisting of cabinets containing hundreds of thousands of processors. For these processors to operate as a single entity, a supercomputer needs to pass data back and forth between its various parts, running huge numbers of computations at the same time, all while minimizing power consumption.

For such highly interconnected supercomputers, performance might be pinched by bottlenecks, such as the ability to access memory or store and retrieve data quickly. Newer machines in fact try to avoid shuffling around information as much as possible, sometimes even recomputing a quantity rather than restoring it from slow memory. Issues with memory and data retrieval are only expected to get worse in exascale. Should any link in the chain of computation have bottlenecks, it can cascade into larger problems. This means that a machine’s peak performance, the theoretical highest processing power it can reach, will be different from its real-world, sustainable performance. “In the best case, we can get to around 60 or 70 percent efficiency,” says Depei Qian, an emeritus computer scientist at Beihang University in Beijing, China, who helps lead China’s exascale efforts.

“You could build an exascale system today,” Steve Conway, senior vice president of research at Hyperion,Conway says. “But it would take well over 100 megawatts, which nobody’s going to supply, because that’s over a 100 million dollar electricity bill. So it has to get the electricity usage under control. Everybody’s trying to get it in the 20 to 30 megawatts range. And it has to be dense. Much denser than any computing today. It’s got to fit inside some kind of building. You don’t want the building to be 10 miles long. And also the denser the machine, the faster the machine is going to be too.”

To meet the requirements of multidisciplinary and multidomain applications, new challenges in architecture, system software, and application technologies must be addressed to help develop next-generation exascale supercomputing systems. According to Advanced Scientific Computing Advisory Committee (ASCAC) report entitled “Top 10 Exascale Research Challenges” listed 10 challenges.

Energy efficiency: Creating more energy-efficient circuit, power, and cooling technologies. Simply scaling up today’s supercomputers would require gigawatts of power, equivalent of a nuclear power plant to run it. It now costs about $1 million a year to run a 1 megawatt system, and current supercomputers are already in the range of 10 megawatts.

Interconnect technology: Moving data in and out of storage, and even within cores, takes much more energy than the calculations themselves. By some estimates, as much as 90% of the power supplied to a high-performance computer is used for data transport.

In the exascale regime, the cost to move a datum will exceed the cost of a floating point operation, necessitating very energy efficient, low latency, high bandwidth interconnects for fine-grained data exchanges among hundreds of thousands of processors.

Memory Technology: Integrating advanced memory technologies to improve both capacity and bandwidth. DRAM memory is too slow and expensive to support exascale.

Scalable System Software: Developing scalable system software that is power-aware and resilience-aware.

Programming systems: Inventing new programming environments that express massive parallelism, data locality, and resilience.

Data management: Creating data management software that can handle the volume, velocity and diversity of data that is anticipated.

Exascale Algorithms: Reformulating science problems and redesigning, or reinventing, their solution algorithms for exascale systems.

Algorithms for discovery, design, and decision: Facilitating mathematical optimization and uncertainty quantification for exascale discovery, design, and decision making.

Resilience and correctness: Ensuring correct scientific computation in face of faults, reproducibility, and algorithm verification challenges.

Scientific productivity: Increasing the productivity of computational scientists with new software engineering tools and environments.

Hardware is not the only challenge—the software comes with its own set of problems. That’s partly because of those GPUs. But even before they came along, programs were parallelized for speed: They were divided into parts that ran at the same time on different CPUs, and the outputs were recombined into cohesive results. The process became even more difficult when some parts of a program had to be executed on a CPU and some on a GPU. Exascale machines will contain on the order of 135,000 GPUs and 50,000 CPUs, and each of those chips will have many individual processing units requiring engineers to write programs that execute almost a billion instructions simultaneously.

So running existing scientific simulations on the new exascale computers is not going to be trivial. “It’s not just picking out a [simulation] and putting it on a big computer,” says L. Ruby Leung, an atmospheric scientist at the Pacific Northwest National Laboratory in Richland, WA. Researchers are being forced to reexamine millions of lines of code and optimize them to make use of the unique architectures of exascale computers, so that the programs can reach as close to the theoretical maximum processing power as possible.

Given these challenges, researchers have been planning for exascale computing for more than a decade.

National Strategic Computing Initiative (NSCI)

The ExaSCALE Computing Leadership Act of 2015 would create research partnerships between industry, universities and U.S. Department of Energy’s national labs to research and develop at least two exascale supercomputer architectures, with the goal of having a fully operational computer system that has reached “exascale” – a measure of speed that is beyond any other system in the world – by 2023.

The three lead agencies for the NSCI are the Department of Energy (DOE), the Department of Defense (DOD), and the National Science Foundation (NSF). In particular, the DOE Office of Science and the DOE National Nuclear Security Administration will together form a joint exascale computing program. The program would accelerate research and development in a variety of fields across the government, industry, and academia.

Executive departments, agencies, and offices (agencies) participating in the NSCI shall pursue five strategic objectives:

Accelerating delivery of a capable exascale computing system that integrates hardware and software capability to deliver approximately 100 times the performance of current 10 petaflop systems across a range of applications representing government needs.
Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing.
Establishing, over the next 15 years, a viable path forward for future HPC systems even after the limits of current semiconductor technology are reached (the “post- Moore’s Law era”).
Increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.
Developing an enduring public-private collaboration to ensure that the benefits of the research and development advances are, to the greatest extent, shared between the United States Government and industrial and academic sectors.

Enabling Technologies Exascale to zettascale computing

An exascale supercomputer is envisioned to comprise of on the order of 100,000 interconnected servers or nodes in a target power envelope of 20 MW, with sufficient memory bandwidth to feed the massive compute throughput, sufficient memory capacity to execute meaningful problem sizes, and with user intervention due to hardware or system faults limited to the order of a week or more on average. These system-level requirements imply that each node delivers greater than 10 teraflops with less than 200W. There is a 7 x gap in flops per Watt between the current most energy efficient supercomputer and the exascale target.

According to Europe’s CRESTA project such systems will likely consist of:

Large numbers of low-power, many-core microprocessors (possibly millions of cores)
Numerical accelerators with direct access to the same memory as the microprocessors (almost certainly based on evolved GPGPU designs)
High-bandwidth, low-latency novel topology networks (almost certainly custom-designed)
Faster, larger, lower-powered memory modules (perhaps with evolved memory access interfaces)

A high-performance accelerated processing unit (APU) that integrates high-throughput GPUs with excellent energy efficiency required for exascale levels of computation, tightly coupled with high-performance multicore CPUs for serial or irregular code sections and legacy applications.

Future HPC platforms will demand both increased memory capacity and high-performance bandwidth to operate efficiently. To boost memory performance, high-bandwidth memory stacks are emerging. Aggressive use of die-stacking capabilities that enable dense component integration to reduce data-movement overheads and enhance power efficiency. Yet this solution comes with another conundrum: with high-bandwidth memory comes smaller capacity and with a non-volatile memory (NVM) comes large capacity but less bandwidth. The emerging NVM technologies such as the memristor will provide the throughput to compete with DDR, but in a persistent and more energy-efficient way.

Photonics is another critical technology; the optical paths require less power, reducing the power consumption of these processors as well as provide wide bandwidth. Energy efficient Optical interconnects are required to boost the performance of densely packed supercomputer chips. Silicon photonics employing lasers shall provide low-power data links within the system.

HPE is leading an industry-wide approach that will revolutionize system architecture. The development of a new and open protocol, temporarily dubbed the next-generation memory interface (NGMI), will increase flexibility when connecting memory devices, processors, accelerators, and other devices, allowing the system architecture to better adapt to any given workload. Then, emerging NVM technologies will power high-performance computing but in a persistent, more energy-efficient way. Advanced circuit techniques and active power management techniques which yield energy reductions with little performance impact. Hardware and software mechanisms to achieve high resilience and reliability with minimal impact on performance and energy efficiency

The applications software to tackle these larger computing challenges will often evolve from current codes, but will need substantial work, Messina said.” First, simulating more challenging problems will require some brand new methods and algorithms. Second, the architectures of these new computers will be different from the ones we have today, so to be able to use existing codes effectively, the codes will have to be modified. This is a daunting task for many of the teams that use scientific supercomputers today. “These are huge, complex applications, often with literally millions of lines of code,” Messina said. “Maybe they took the team 500 person years to write, and now you need to modify it to take advantage of new architectures, or even translate it into a different programming language.”

Writing programs for such parallel computing is not easy, and theorists will need to leverage new tools such as machine learning and artificial intelligence to make scientific breakthroughs. But the transition to exascale will not be easy. “As these machines grow, they become harder and harder to exploit efficiently,” says Danny Perez, a physicist at Los Alamos National Laboratory in NM. “We have to change our computing paradigms, how we write our programs, and how we arrange computation and data management.”

Teams around the world are wrestling with the different tradeoffs of achieving exascale machines. Some groups have focused on figuring out how to add more CPUs for calculations, making these mainframes easier to program but harder to power. The alternative approach has been to sacrifice programmability for energy efficiency, striving to find the best balance of CPUs and GPUs without making it too cumbersome for users to run their applications. Architectures that minimize the transfer of data inside the machine, or use specialized chips to speed up specific algorithms, are also being explored.

Energy efficiency & Cooling technology

Such powerful computing required enormous amounts of electricity. Unfortunately, much of this power it was getting lost as wasted heat—a considerable concern in the mid-2000s, as researchers grappled with petascale computing capable of 1015 calculations per second. By 2006, IBM partly solved this problem by designing chips known as graphics processing units (GPUs), meant for the newly released Sony PlayStation 3. GPUs are specialized for rapidly rendering high-resolution images. They divide complex calculations into smaller tasks that run simultaneously, a process known as parallelization, making them quicker and more energy-efficient than generalist central processing units (CPUs). GPUs were a boon for supercomputers.

In 2008, when Los Alamos Laboratory unveiled Roadrunner, the world’s first petascale supercomputer, it contained 12,960 GPU-inspired chips along with 6,480 CPUs and performed twice as well as the next best system at the time. Besides GPUs, Roadrunner included other innovations to save electricity, such as turning on components only when necessary. Such energy efficiency was important because predictions for achieving exascale back then suggested that engineers would need “something like half of a nuclear power plant to power the computer,” says Perez.

With air cooling, a data center consumes about 60% of the server power to cool the servers,
* 40% of server power with chilled water cooling,
* less than 10% with warm water like the SD650.

The potential savings for a supercomputer cluster that has a power consumption of 4-5 MW can be €100,000’s savings per year over 4-5 years. A 250 petaflop supercomputer immersed in liquid coolant is being built in Texas. In 2019, it could be the world’s most powerful supercomputer. 40,000 server immersed in coolant are going into a data center in Houston. It is being built by DownUnder GeoSolutions (DUG).

It will perform cutting-edge computer modeling for energy companies and bring new levels of precision to oil and gas exploration. It will be housed in the Skybox Datacenters facility in Houston’s Energy Corridor, where DUG has leased 15 megawatts of capacity. The deal, represented by Bennett Data Center Solutions, is the largest colocation transaction in Houston’s history.

720 enclosures using the DUG Cool liquid cooling system, which fully submerges servers in tanks filled with dielectric fluid. This will reduce the huge system’s energy usage by about 45 percent compared to traditional air cooling. A 60-megawatt power system could be scaled for an exaflop supercomputer in 2020 or 2021 using the Australian technology.

The cooling system fully submerges standard high-performance computing (HPC) servers into specially-designed tanks that are filled with polyalphaolefin dielectric fluid. The fluid is non-toxic, non-flammable, biodegradable, non-polar, has low viscosity, and most importantly, doesn’t conduct electricity. The unique part of this design is that the heat exchangers are very simple and submerged with the computer equipment, meaning that no dielectric fluid ever leaves the tank. A water loop runs through the rooms and to each heat exchanger. The dielectric fluid is cooled and circulated around the extremely hot components in the compute servers. This innovative oil-cooling solution has high thermal capabilities and a large operating temperature range.

The over 1000x thermal capacity of the fluid vs. air means that components never get hot, reducing their mean time to failure. Fluid-immersed computers fail at a much lower rate, considerably reducing maintenance costs and expensive down-time.

Silicon Photonics

The problems of copper pins and electrical signaling encompasses power, data reach, and chip real estate limitations. As the performance of processors advance, faster data rates are required; more data is needed to feed the chips and more data is generated during processing. Although growth in processor performance has slowed over the past several years due to the erosion of Moore’s Law, electrically-driven chip communication has still been unable to keep pace. According to Ayar Labs chief strategy officer and company co-founder Alex Wright-Gladstein, there is a consensus that the highest data rate that will be able to exit a chip package is 100 Gb/sec.

Looking just slightly into the future, a 10 teraflops processor that could be the basis for an exascale supercomputer would need something on the order of 10 Tb/sec of chip I/O to be usable. But that would require around 2,000 copper pins, which together would draw about 100 watts – not the chip just the I/O pins. If that sounds problematic, that’s because it is.

Ayar labs developing one terabit per second electro-optical I/O chip

Ayar Labs, a silicon photonics startup based in Emeryville, California, is getting set to tape out its electro-optical I/O chip, which will become the basis of its first commercial product. Known as TeraPHY, it’s designed to enable chip-to-chip communication at lightning speed. The company is promising bandwidth in excess of one terabit per second, while drawing just a tenth the power of conventional electrically-driven copper pins.

According to the company’s website, the initial TeraPHY device will be available as a 1.6 Tb/sec optical transceiver, comprised of four 400 Gb/sec transceivers per module. All the componentry except for the light source (which is supplied by a separate 256-channel laser module, called SuperNova) has been integrated into the device. That includes the electrical interfaces, the optical modulators, the photodetectors, and the dense wavelength division multiplexing (DWDM) wavelength multiplexer/demultiplexer, as well as all the driver and control circuitry.

Wright-Gladstein says the solution not only delivered a high performance electro-optical device, but was able to do so in an area 1/100 the size of a typical long-haul optical transceiver. “Because of that 100X size difference, you’re now crossing that threshold set by electrical SerDes, and you’re making an optical I/O that’s smaller than your electrical I/O,” she says.

Forgoing the more exotic designs of other silicon photonics solutions required some extra tinkering. The design uses optical “micro-ring resonators” implemented in CMOS to achieve the extreme density of the TeraPHY. These resonators can be “finicky” due to thermal issues and size, but according to Wright-Gladstein, they’ve implemented a patented thermal tuning technology that stabilizes the resonators and makes them very reliable.

Physically, TeraPHY is in the form of an “chiplet,” a chunk of silicon that is meant to be integrated into the kind of multi-chip modules that are becoming more commonplace in high-end processor packages. The TeraPHY tape out is slated for the end of the current quarter (Q1 2019), with the first integrated products from Ayar’s silicon technology partners due to hit the street in 2020.

China’s homegrown technology to exascale

China’s has claimed build world’s first exascale supercomputer, ready as soon as 2019, according to an Hong, professor of computer science at University of Science and Technology of China in Hefei.

China’s 13th five-year plan, had puts into motion one of the most ambitious exascale programs in the world. If successful, the program will stand up an exaflops (peak) supercomputer by the end of 2020 within a 35 MW power limit. According to Professor Qian, the number one priority task is the development of an exascale supercomputer, based on a multi-objective optimized architecture that balances performance, energy consumption programmability, reliability and cost.

To achieve this goal, China is funding research into novel high performance interconnects with 3-D chip packaging, silicon photonics and on-chip networks. Programming models for heterogeneous computers will emphasize ease in writing programs and exploitation of performance of the heterogeneous architectures.

The program included the development of prototype systems for verification of the exascale computer technologies. The computer scientists explored possible exascale computer architectures, interconnects which can support more than 10,000 nodes, and energy efficiency technologies, as power demand is known to be one of the biggest obstacles toward exascale.

The exascale prototype will be about 512 nodes, offering 5-10 teraflops-per-node, 10-20 Gflops/watt, point to point bandwidth greater than 200 Gbps. MPI latency should be less than 1.5 us, said Qian. Development will also include system software and three typical applications that will be used to verify effectiveness. From there, work will begin on an energy-efficient computing node and a scheme for high-performance processor/accelerator design.

“Based on those key technology developments, we will finally build the exascale system,” said Qian. “Our goal is not so ambitious – it is to have exaflops in peak. We are looking for a LINPACK efficiency of greater than 60 percent. Memory is rather limited, about 10 petabytes, with exabyte levels of storage. “We don’t think we can reach the 20 megawatts system goal in less than five years so our goal is about 35 megawatts for the system; that means 30 Gflops/watt energy efficiency. The expected interconnect performance is greater than 500 Gbps.

Europe’s EuroEXA exascale investment

EuroEXA is the latest in a series of exascale investments by the European Union (EU), which will contribute €20 million to the project over the next three and a half years. It consolidates the research efforts of a number of separate projects initiated under the EU’s Horizon 2020 program, including ExaNeSt (exascale interconnects, storage, and cooling), EcoScale (exascale heterogeneous computing) and ExaNoDe (exascale processor and node design).

Commenting on the effort was John Goodacre, Professor of Computer Architectures at the University of Manchester. “To deliver the demands of next generation computing and exascale HPC, it is not possible to simply optimize the components of the existing platform,” he said. “In EuroEXA, we have taken a holistic approach to break-down the inefficiencies of the historic abstractions and bring significant innovation and co-design across the entire computing stack.”

In more concrete terms, the project will develop HPC-capable ARM and Xilinx FPGA designs, which will be incorporated into an operational prototype by 2020, along with new memory and cooling technologies. The hope is that this will be the basis of a European exascale system to be deployed in the 2022-2023 timeframe.

Atos reveals Bull sequana, the world’s most efficient supercomputer that will reach exascale by 2020

An exascale class-system relying on current technologies, would consume around 400 megawatts, equivalent to the yearly electrical annual consumption of 60,000 homes. French computer-maker Atos’s Bull sequana reduces energy consumption by a factor of 10 as compared to the previous generation of supercomputers. Taking the compute performance to a whole new level, Bull sequana will reach exascale level by 2020, processing a billion billion operations per second. Bull is targeting an electrical consumption of 20 megawatts for exascale by 2020, thanks to its technologies and its research and development centers.

Designed to integrate the most advanced technologies in the futures, Bull sequana is already being implemented at the French Alternative Energies and Atomic Energy Commission (CEA). In its current rendition, the Sequana X1000, the system will support the latest Intel Xeon CPUs and Xeon Phi processors, as well as NVIDIA GPUs. Nodes can be linked together with either InfiniBand or the proprietary Bull eXascale Interconnect (BXI).

The performance of exascale applications requires massive parallelism. Bull sequana integrates the Bull Exascale Interconnect (BXI) network, developed by Bull. Designed for exascale, BXI revolutionizes the treatment of data exchanges by freeing the processors of all communications tasks. In addition, the Bull sequana software environment allows a precise management of large resources and provides optimum production efficiency

Exascale computers will invigorate a variety of fields, researchers say. “It enables the young people to do things that haven’t been done before and that brings a different level of excitement to the table,” Choong-Seock Chang of Princeton University says. And exascale is only the beginning, he adds: “To me an exascale supercomputer is just a signpost along the way. Human creativity will drive you further.”

Zettascale computing

Exascale machines (of at least a 1 exaflops peak) are anticipated to arrive by around 2021, a few years behind original predictions; and given extreme-scale performance challenges are not getting any easier, it makes sense that researchers are already looking ahead to the next big 1,000x performance goal post: zettascale computing. A team from the National University of Defense Technology in China, responsible for the Tianhe series of supercomputers, suggests that it will be possible to build a zettascale machine by 2035. The paper outlines six major challenges with respect to hardware and software, concluding with recommendations to support zettascale computing.

The article “Moving from exascale to zettascale computing: challenges and techniques,” published in Frontiers of Information Technology & Electronic Engineering, (as part of a special issue organized by the Chinese Academy of Engineering on post-exascale computing) works as high-level survey of focus areas for breaching the next big performance horizon. And when might that be? The research team, even while pointing to slowdowns in performance gains, has set an ambitious goal: 2035. For the purposes of having a consistent metric, they’ve defined zettascale as a system capable of 10^21 double-precision 64-bit floating-point operations per second peak performance.

The potential impact of mixed-precision arithmetic and AI-type algorithms on performance metrics (already in motion) was not a focus topic, but the authors did note, “With the continuous expansion of application types and scales, we expect that the conventional scientific computing and the new intelligent computing will further enrich the application layer. Techniques (such as machine learning) will be used to auto-tune various workloads during runtime (Zhang et al., 2018).”

The likely impact on architectures was also noted: “[S]ince conventional HPC applications and emerging intelligent computing applications (such as deep learning) will both exist in the future, the processor design should take mixed precision arithmetic into consideration to support a large variety of application workloads.”

“To realize these metrics, micro-architectures will evolve to consist of more diverse and heterogeneous components. Many forms of specialized accelerators (including new computing paradigms like quantum computing) are likely to co-exist to boost high performance computing in a joint effort. Enabled by new interconnect materials such as photonic crystals, fully optical-interconnecting systems may come into use, leading to more scalable, high-speed, and low-cost interconnection.

“The storage system will be more hierarchical to increase data access bandwidth and to reduce latency. The 2.5D/3D stack memory and the NVM technology will be more mature. With the development of material science, the memristor may be put into practice to close the gap between storage and computing, and the traditional DRAM may end life. To reduce power consumption, cooling will be achieved at multiple levels, from the cabinet/board level to the chip level.

“The programming model and software stack will also evolve to suit the new hardware models. Except for the MPI+X programming model, new programming models for new computing paradigms and new computing devices will be developed, with the balance of performance, portability, and productivity in mind. Conventional HPC applications and emerging intelligent computing applications will co-exist in the future, and both hardware and software layers need to adapt to this application workload evolution (Asch et al., 2018).”

References and Resources also include:

https://insidehpc.com/2017/01/exascale-computing-race-to-the-future-of-hpc/

https://www.hpcwire.com/2018/12/06/zettascale-by-2035/

https://www.nextplatform.com/2019/01/29/first-silicon-for-photonics-startup-with-darpa-roots/

https://www.pnas.org/content/117/37/22623