This is a story about scale. Billions of dollars are being spent; thousands of the world’s brightest minds are competing across vast bureaucratic systems; and huge corporations are fighting for contracts. All of this is necessary to build unfathomably powerful machines to handle problems beyond our current capabilities.

Most of all, this is a story about nations. For all the talk of Pax Americana, of the ‘end of history,’ of a time of peace, prosperity and cooperation, reality has not been so kind. Competition still reigns - over resources, over science, over the future. And to build that future, superpowers need supercomputers.

“There is a race going on,” Mike Vildibill, VP of HPE's advanced technologies group, told DCD. “I don't think it's as public as the space race was. Perhaps that’s in part because some of the implications really refer to national security. It's not simply who gets there first, in terms of national pride - although there is a lot of national pride - but there's something much more serious going on.”

This race, decades in the making, will bring into existence the first exascale supercomputers, systems capable of at least one exaflops, or a quintillion (that’s a billion billion) calculations per second.

In some senses, ‘exascale’ is an arbitrary figure, especially given limitations with the benchmarks used, but “this is not just about bragging rights,” Eric Van Hensbergen, distinguished engineer and acting director of HPC at Arm Research, told DCD.

“Exascale supercomputers are a critical piece of national infrastructure: It's about how we are going to discover the next material, the next type of reactor, the next type of battery. All of that's driven by modeling and simulation on supercomputers.

“And so if you don't have access to that resource, that puts you in a bad position economically and in regards to security, and that's why you see this nationalistic fervor around supercomputers.”

But to understand the future of computing, one must first understand what we have achieved so far. With a peak performance of 200 petaflops, today’s most powerful supercomputer can be found at the US Department of Energy’s Oak Ridge National Laboratory in Tennessee.

Housed in a 9,250 square foot (859 sq m) room, ‘Summit’ features some 4,608 compute servers, each having two 22-core IBM Power9 CPUs and six Nvidia Tesla V100 GPUs.

It consumes 13MW of power, and has 4,000 gallons of water rushing through its cooling systems every minute.

Summit, like the Sierra system and the upcoming Perlmutter supercomputer, was created as a stepping stone on the path to exascale, a way to estimate the magnitude of the task that lies ahead.

Summit supercomputer
– Oak Ridge National Laboratory

Step one

“They wanted an intermediate step to support software development and allow people to begin exploring the technology and how to scale code up,” IBM’s head of exascale, Dave Turek, told DCD. “They wanted that step so that all the design, manufacturing, other processes could be shaken out.”

The three pre-exascale systems, collectively known as CORAL, enabled the DOE to align its exascale ambitions with the reality of the market. “A lot of things have to occur in lockstep, and when they don't, you have to be clever enough to figure out a way around that. With the costs of storage in an era of big data, what technology do you use? When is phase-change memory going to come? How will I utilize that? How much risk am I exposing to myself based on assuming that it will be here?” Turek said.

"We're not sitting around waiting for the exascale systems to show up,” Doug Kothe, director of the US government’s Exascale Computing Project (ECP), explained. “These pre-exascale systems are critical to our success. We're doing a lot of work on these systems today.”

All of these efforts are meant to prepare the nation for the arrival of exascale systems. Lori Diachin, deputy director of ECP, laid out the roadmap: “The first exascale systems are going to be deployed at Argonne and Oak Ridge, first the Aurora and then Frontier systems in the 2021-22 timeframe, and then the El Capitan system probably around 2023.”

Creating these machines will require a huge amount of additional work - one can’t, for example, simply build Summit, but at a larger scale.

“So, we could do three [Summits together],” IBM’s Turek said. “We couldn't do five, and we have actually looked into that. The problems are elusive to most people because no one's operated at this level, and I've been doing this for 25 years. You discover things at scale that don't manifest themselves in the non-scale arena. And as much as you plan, as much as you think about it, you discover that you run into problems you didn't anticipate before.

“When we considered five, we looked at the current state of network architecture. We looked at the power consumption. We looked at the cost. And we said, ‘Not a great idea.’”

An initiative run under the ECP umbrella hopes to drive down those costs and ameliorate the technical challenges. The PathForward program provides $258m in funding, shared between AMD, Cray, HPE, IBM, Intel and Nvidia, which must also provide additional capital amounting to at least 40 percent of the total project cost.

“PathForward projects are designed to accelerate technologies to, for example, reduce power envelopes and make sure those exascale systems will be cost-competitive and power-competitive,” Diachin said.

For example, IBM’s PathForward projects span ten separate work packages, with a total of 78 milestones, across hardware and software. But for Turek, it is the software that is the real concern.

“I wouldn't worry about the hardware, the node, pretty much at all. I think it's really communications, it's file systems, it's operating systems that are the big issues going forward," he said.

“A lot of people think of building these supercomputers as an exercise in hardware, but that’s absolutely trivial. Everything else is actually trying to make the software run. And there are subtly complex problems that people don't think about. For example, if you're a systems administrator, how do you manage 10,000 nodes? Do you have a screen that shows 10,000 nodes and what's going on? That's a little hard to deal with. Do you have layers of screens? How do you get alerts? How do you manage the operational behavior of the system when there's so much information out there about the component pieces?”

Lori Diachin
Lori Diachin, ECP – Sebastian Moss

Each of the upcoming exascale systems in the US are budgeted under the CORAL-2 acquisition process, which is open to the same six US vendors involved with PathForward. “Those are not part of the ECP itself, but rather a separate procurement process,” Diachin explained.

“The ECP is a large initiative that is really being designed to make sure that those exascale platforms are going to be ready to go on day one, with respect to running real science applications.

“Our primary focus is on that software stack. It's really about the application, it's about the infrastructure, and the middleware, the runtime systems, the libraries.”

For a future exascale system, IBM is likely to turn to the same partner it has employed for Summit and Sierra - GPU maker Nvidia.

“You can already see us active in the pre-exascale era,” the company’s VP of accelerated computing and head of data centers, Ian Buck, told DCD. “Traditional Moore's Law has ended, we cannot build our exascale systems, or our supercomputers of the future, with traditional commodity CPU-based solutions,” he said.

“We've got to go towards a future of accelerated computing, and that's not just about having an accelerator that has a lot of flops, but looking at all the different tools that accelerated computing brings to the exascale.

“I think the ingredients are now clearly there on the table. We don't need to scale to millions of nodes to achieve exascale. That was an early concern - is there such a thing as an interconnect that can actually deliver or run a single application across millions of nodes?”

Accelerated computing - a broad term for tasks that use non-CPU processors, that Nvidia has successfully rebranded to primarily mean its own GPUs - has cut the number of nodes required to reach exascale. “The reality is: with accelerating computing, we can make strong nodes, nodes that have four or eight, or sometimes even 16 GPUs in a single OS image, and that allows for technologies like InfiniBand to connect a modest number of nodes together to achieve amazing exascale-class performance,” Buck said.

Intel Inside
– Sebastian Moss

The first light

IBM and Nvidia are together, and separately, vying for several exascale projects. But if everything goes to plan, Intel and Cray will be delivering the first US exascale system - thanks to the past not going to plan.

Previously, the US announced that Intel and Cray would build Aurora, a 180 petaflop pre-exascale system, at Argonne National Laboratory for 2018. It was supposed to be based on Intel's third-generation ‘Knights Hill’ Xeon Phi processors, and use Cray’s ‘Shasta’ supercomputing architecture. Instead, by the summer of 2018, Knights Hill was discontinued.

“As we were looking at our investments and we continued to invest into accelerators in the Xeon processor, we found that we were able to continue to win with Xeon in HPC,” Jennifer Huffstetler, VP and GM of data center product management at Intel Data Center Group, told DCD as way of explaining why Phi was canceled.

The decision on how to proceed without Phi came as China was announcing a plan to build an exascale system by 2020 - way ahead of the US’ then-goal of 2023. Hoping to remain competitive, the DOE brought Aurora forward to 2021, increasing its required computing capabilities, with the aim of being the country’s first exascale system.

Details of the system are still shrouded in mystery and, with the initial Aurora system canceled, there will not be a pre-exascale system to trial the technology. In a re-reveal in March, Intel announced that the $500m Aurora will feature an upcoming Xeon Scalable Processor, and the 'Xe' chip, of which few details are known.

We do know, however, that Aurora will rely on Cray’s Shasta platform. Coming first to the Perlmutter machine at the National Energy Research Scientific Computing Center (NERSC) in 2020, “the Shasta architecture is designed to be exascale capable,” Cray CTO Steve Scott told DCD.

"Its design is motivated by increasingly heterogeneous data-centric workloads." With Shasta, designers can mix processor architectures, including traditional x86 options, Arm-based chips, GPUs and FPGAs. "And I would expect within a year or two we'll have support for at least one of the emerging crop of deep learning accelerators," Scott added.

"We really do anticipate a Cambrian explosion of processor architectures because of the slowing silicon technology, meaning that people are turning to architectural specialization to get performance. They're diversifying due to the fact that the underlying CMOS silicon technology isn't advancing at the same rate it once did. And so we're seeing different processors that are optimized for different sorts of workloads."

Katie Antypas, division deputy and data department head at NERSC, told DCD: “I think a lot of the exascale systems that are coming online will look to Perlmutter for the kind of technologies that are in our system.”

Perlmutter, which features AMD Epyc CPUs, Nvidia GPUs and Cray's new interconnect, Slingshot, “is also the first large-scale HPC that will have an all-flash file system, which is really important for our workload,” Antypas said. “For our workloads that are doing a lot of data reads, this flash file system will provide a lot of speed up.”

Update: Cray's Shasta architecture and Slingshot interconnect will feature in another exascale supercomputer. Coming in 2021, the 1.5 exaflops Frontier supercomputer will use upcoming AMD Epyc CPUs and Radeon Instinct GPUs.

Frontier will consist of more than 100 Cray Shasta cabinets, with densities of up to 300kW per cabinet, and a 4:1 GPU to CPU ratio. The processors will have AMD Infinity Fabric links and coherent memory between them within the node, while each node will have one Cray Slingshot interconnect network port for every GPU. In total, it will use more than 90 miles (145km) of cabling.

An enterprising approach

Another contender is HPE, whose PathForward efforts are co-led by Vildibill. “The work is really split between five different general areas, and across those five areas we have 20 teams that are actually doing the R&D. It's quite extensive, ranging from optics, to a Gen-Z based system design, to manufacturing of silicon, to software and storage.”

Gen-Z came out of HPE’s much-hyped research project called ‘The Machine,’ Vildibill explained. “It's a chip-to-chip protocol that is a fundamental building block for almost everything that we're working on here. We've since put the Gen-Z technology into an open industry collaboration framework,” he added, name-checking partners like Google, Cray and AMD.

HPE realized that “data movement consumes, and will consume, over an order of magnitude more energy to move the data than it does to compute the data,” Vildibill said. “The power consumption of just moving the data was exceeding the entire power envelope that we need to get to exascale computing.”

"The toughest of the power problems is not on circuits for computation, but on communication," a mammoth study detailing the challenges on the path to exascale, led by DARPA, warned back in 2008. “There is a real need for the development of architectures and matching programming systems solutions that focus on reducing communications, and thus communication power.”

Vildibill concurred: “A shift in paradigm is needed: we've got to figure out a system where we don't move the data around quite so much before we compute it.”

Gen-Z hopes to bring about this shift, with the open standard enabling high bandwidth, low latency connections between CPU cores and system memory. “It’s a technology that is very vital to our exascale direction,” Vildibill said.

Doug Kothe
Doug Kothe, ECP – Sebastian Moss

Arming up

Outside of PathForward, there is another approach HPE might take for future exascale systems - Arm chips. The company is behind the world’s largest Arm-based supercomputer, Astra, located at Sandia National Laboratories, with 2,592 dual-processor servers featuring 145,000 Cavium ThunderX2 cores.

Vildibill, whose team was responsible for Astra and the Apollo 70 servers within it, was quick to point out the project is independent of his PathForward work. But he added: "We're interested in both because we have some fundamental challenges and barriers to getting to exascale.

"What Arm offers us is a new approach to what was on our roadmaps a year or two ago. The interesting thing about Arm is that it has a development cycle that spins very quickly. Traditional CPU designs have to satisfy the commercial market, the enterprise market, the laptop market, the exascale market, etc. So it's very difficult, or very disruptive, to make changes to those CPUs that apply to all markets.”

With Arm, he said, “it is much easier for somebody to develop a very capable CPU that might be targeted at a narrow niche, and therefore can innovate more quickly and not be worried too much about disrupting all of their businesses."

Arm may end up in one of the DOE’s future exascale systems, as “one of the requirements of our exascale program is that the systems must be diverse,” the ECP’s Diachin told DCD.

She added that the DOE has also taken pains to ensure that software and exascale applications are designed to work across platforms, “because we're seeing such a wide variety of architectures now.”

But while Arm’s future in exascale projects in the US is not clear, the architecture will definitely make a big splash in Japan. Fujitsu and Japanese research institute Riken aim to develop the nation's very first exascale system, currently known as Post-K, by 2021.

Powering Post-K is A64FX, the first CPU to adopt the Scalable Vector Extension (SVE), a development of the Armv8-A instruction set architecture made with HPC in mind. SVE enables vector lengths that scale from 128 to 2048 bits in 128-bit increments. Rather than specifying a specific vector length, CPU designers can choose the most appropriate vector length for their application and market - allowing for A64FX to be designed with a focus on exascale computing.

“That was key. If we do things correctly, the ISA and the architecture gives you an envelope in which you can do a lot of diversification, but without hurting the software ecosystem,” Arm’s Van Hensbergen explained.

The adoption of SVE, which has enabled Japan and Fujitsu to target such an ambitious timeframe, was a lucky occurrence. “We were fortunate with SVE in that the origins of that was something very different, and it kind of lined up very nicely; just as we were about to put it on the shelf, Fujitsu and Cray came along and said 'it would be nice if you had this' and we were like 'ah, I was just about to put this away,' and then we re-tailored it and re-tuned it. Otherwise, it may have taken us longer to get it to market, honestly.”

Arm may also find its way into European exascale systems, with the EU still deciding on the specifics of its approach.

“There is no high performance chip design in Europe, not for general purpose processors anyway,” Van Hensbergen said. “It eroded steadily over the years, with [European organizations] buying from other countries, and the local industry disappeared. But now they're spending hundreds of millions of euros to try to rebootstrap that industry, because they don't want to be at a disadvantage if politics does start interfering with supply.”

The European project

The details remain murky, but the level of ambition is clear. Juan Pelegrin, Head of Sector Exascale Computing at the European Commission, said: “HPC is critical, and Europe has to be sovereign, we need to be able to rely on ourselves and our own technology.

“We’re going to buy two pre-exascale, plus two or three petascale systems by 2020, two exascale by 2022/23, one of which will primarily rely on European technology. And then we are looking at post-exascale infrastructure by 2027 - given the advancements in the quantum computing field, we hope that the quantum components will fit into the HPC infrastructure.”

The project has secured around €500m (US$569m) from the EU to invest in 2019-2020, with participating governments set to match that level, along with private businesses providing in-kind contributions to the tune of €422m ($480m).

EuroHPC, the joint undertaking that is meant to run all the European supercomputing efforts, has also made a bid to get a further €2.7bn ($3bn) from the EU Digital Europe program for HPC investment across exascale, quantum computing, and competing architectures like neuromorphic computing.

Pelegrin added: “There’s also the Horizon Europe program. We’ve made a bid for €100bn, but don’t get excited - it’s not all for HPC. That will be for new research only: algorithms, etc.”

For the exascale system made with European technology, the EU has kicked off a huge project within the wider exascale program - the European Processor Initiative.

“If we look back 60 years ago, what they did in the US with the Apollo program, it sounded crazy at the time, and then they made it,” Philippe Notton, the general manager of the EPI and head of processors at Atos, said.

“Now I am glad and proud to introduce the next moonshot in Europe, which is EPI - different kind of budget, same kind of mission. Crazy, but we're going to do it.”

The plan? To build a European microprocessor using RISC-V architecture, embedded FPGA and, perhaps, Arm (“pending the final deal”) components, ready for a pre-exascale system in 2021, and an upgraded version for the exascale supercomputer in 2022/23.

Featuring 23 partners across academia and business, the group also hopes that the low-power chip will be used in the automotive market, presumably for autonomous vehicles - with BMW, one of those partners, expecting to release a test car sporting a variant of the chip in a few years.

“We are also working on PCIe cards, server blades, HPC blades, to hit the target of 2021 - you will have lots of things from EPI,” Notton said, adding “it’s thanks to this,” as he pointed to the bags under his eyes.

Another exciting approach to exascale can be found in China. “They've invested a tremendous amount in building their ecosystem,” Van Hensbergen said. “In some ways they are the poster child for how to catalyze an advantage by just pouring money into it. I think that everyone knows that they're being very aggressive about it, and everyone is trying to react to it, but it's difficult to keep up with the level of investment that they're putting in.”

Professor Qian Depei
Professor Qian Depei – Sebastian Moss

A national movement

Professor Qian Depei, chief scientist for the national R&D project on high performance computing in China, explained: “It was quite unusual [for China] to continually support key projects in one area, but this one has been funded for 15 years. That reflects the importance of the high performance program.”

The result was the ‘National Grid,’ which features some 200 petaflops in shared computing power across the country, with roughly 19,000 users. Yet the country wants more.

Ultimately, “the goal is to build exascale computers with self-controllable technology, that's a kind of lesson we learned in the past. We just cannot be completely bound to external technology when we build our own system.”

This lesson, hammered home in the ongoing US-China trade war, was first experienced years ago. In 2015, the US Department of Commerce banned US companies from selling Xeon and Xeon Phi CPUs to China's leading national laboratories, claiming the chips were being used to build systems that simulated "nuclear explosive activities," presenting a national security threat.

The ban backfired - realizing its vulnerability in foreign markets, China pumped money into domestic computing. “I would say it helps the Chinese when the US imposes an embargo on technology, it just forced more funding to go into developing that technology. It accelerated the roadmap,” professor Jack Dongarra, who curates the bi-annual Top500 list of the world’s fastest computers, told DCD.

“They have made considerable progress since the embargo was imposed. They have three exascale machines in planning, and I would expect them to deliver something in the 2020-21 timeframe, based on their own technology.”

Before it develops the exascale systems, China is following the same approach taken by the US and the EU in developing several prototypes. “The three prototypes are Sugon, Tianhe-3, and Sunway,” Qian said. “We hope that they show different approaches towards a future exascale system.”

Sugon adopts “a relatively traditional accelerated architecture, using x86 processors and accelerators, with the purpose of maintaining legacy software assets.” It uses processors from Chinese chip-maker Hygon, which has a complicated licensing agreement allowing it to make chips based on AMD’s Zen microarchitecture. “Sugon uses low-temperature evaporative cooling, and has no need for fans,” Qian said. “It has a PUE below 1.1, and as the system will be cooler, it increases reliability and performance.”

Next up is Tianhe-3, which is “based on a new manycore architecture. Each computing node includes three Matrix-2000+ processors, it uses a butterfly network interconnect, where the maximum number of hops in a communication path is four for the whole system,” Qian explained. “They are working on the next generation interconnect to achieve more than 400Gbps bandwidth.”

Then there’s Sunway, another manycore-based machine. “Currently the system is still implemented using the old processor, the ShenWei 26010. The number of nodes is 512, and the peak performance is three petaflops,” but it will soon be upgraded with a newer processor.

In developing the prototypes, China has identified several important bottlenecks, Qian said. “For example, the processor, including manycore processor and accelerator, the interconnect, the memory - 3D memory is a big issue, we don't have that capability in China - and software, that will probably take a longer time to improve, and is a big bottleneck.”

Considering its attempts to design homegrown processors, “the ecosystem has become a very crucial issue” in China, Qian said. “We need the language, the compilers, the OS, the runtime to support the new processors, and also we need some binary dynamic translation to execute commercial software on our new system. We need the tools to improve the performance and energy efficiency, and we also need the application development support. This is a very long-term job. We need the cooperation of the industry, academia and also the end-users.”

With a rough target of 2020, China may beat the US to the exascale goal, but Qian admitted the first machines will likely be “modest compared with US exascale computers.”

Once each nation has built its exascale systems, the demand will come again: More. This is a race, but there is no end in sight.

“For us, the big question is what is next?” Marcin Ostasz, of the European industry-led think tank ETP4HPC, said. “We have a lot of projects in place, we know there is a lot of funding through EuroHPC.”

“We invited the gurus, the brains of the world, into a workshop in June, and we asked the question: 'Where is the world going?' And we got a lot of good answers that will help us find this vision of the post-exascale systems.”

Like the systems before it, that vision will likely be driven by the needs of a society desperate to find solutions to complicated problems, Ostasz said. "We need to look at the gas emissions, at the human burden of dementia; there are challenges in health, in food, etc. These are the applications we will need to address."

This feature appeared in the February issue of DCD Magazine. Subscribe for free today: