As Riken and others in the supercomputing field look to the cloud for ideas, the hyperscalers have equally turned to the HPC field to understand how to deploy massively interconnected systems.
But, as we have seen, the giants have found that their financial resources have enabled them to outflank the traditional supercomputers.
Sudden changes are always possible but, for now, this leaves hyperscalers like Microsoft and Google in the lead - and developing new architectures for their cloud in the process.
Microsoft: Hyperscale to superscale
"My team is responsible for building the infrastructure that made ChatGPT possible," Nidhi Chappell, Microsoft GM for Azure AI, said. "So we work very closely with OpenAI, but we also work on all of our overall AI infrastructure."
Chappell’s division has been responsible for deploying some of the largest compute clusters in the world. “It's a mindset of combining hyperscale and supercomputing together into the superscale generation,” she said.
This has been a multi-year transition at the company, as it brings the two worlds together. Part of that has involved a number of high-profile hires from the traditional HPC sector, including NERSC's Glenn Lockwood, Cray's CTO Steve Scott, and the head of Cray's exascale efforts, Dr. Dan Ernst.
“All of these people that you're talking about are a part of my team,” Chappell said. “When you go to a much higher scale, you're dealing with challenges that are at a completely different scale altogether. Supercomputing is the next wave of hyperscale, in some regard, and you have to completely rethink your processes, whether it's how you procure capacity, how you are going to validate it, how you scale it, and how you are going to repair it.”
Microsoft does not share exactly what that scale is. For its standard public instances, they run up to 6,000 GPUs in a single cluster, but “some customers do go past the public offerings,” Chappell said.
OpenAI is one of those customers, working with Microsoft on specialized deployments that are much larger, since the $1bn deal between the companies. “But it is the same fundamental blocks that are available for any customer,” she said.
Size is not the only challenge her team faces. As we saw earlier, researchers are working with ever-larger models, but are also running them for much longer.
“When you're running one single job nonstop for six months, reliability becomes front and center,” she said. “You really have to rethink design completely.”
At the scale of thousands of GPUs, some will break. Traditionally, “hyperscalers will have a lot of independent jobs and so you can take some fleet out and be okay with it,” she said.
“For AI training, we had to go back and rethink and redesign how we do reliability, because if you're taking some percentage of your fleet out to maintain it, that percentage is literally not available.
“We had to think how we could bring capacity back quickly. That turnaround time had to be reduced to make sure that all the fleet is available, healthy, and reliable all the time. That's almost fighting physics at some point.”
That scale will only grow as models expand in scope and time required. But just as OpenAI is benefitting from the flywheel of usage data to improve its next generation of models, Microsoft is also learning an important lesson from running ChatGPT’s infrastructure: how to build the next generation of data centers.
“You don't build ChatGPT's infrastructure from scratch,” she said. “We have a history of building supercomputers that allowed us to build the next generation. And there were so many learnings on the infrastructure that we used for ChatGPT, on how you go from a hyperscaler to a supercomputing hyperscaler.”
As the models get bigger and require more time, that “is going to require us to continue on the pace of bigger, more powerful infrastructure,” she said. “So I do think the pivotal moment [of the launch of ChatGPT] is actually the beginning of a journey.”
Google: From search to AI
Google also sees this as the start of something new. “Once you actually have these things in people's hands, you can start to specialize and optimize,” said the head of the search giant’s global systems and services infrastructure team, Amin Vahdat.
“I think that you're gonna see just a ton of refinement on the software, compiler, and the hardware side,” he added. Vahdat compared the moment to the early days of web search, when it would have been unimaginable for anyone to be able to index the contents of the Internet at the scale that we do today. But as soon as search engines grew in popularity, the industry rose to the challenge.
“Over the next few years, you're going to see dramatic improvements, some of it from hardware and a lot of it from software and optimizations. I think that hardware specialization can and will continue, depending on what we learned about the algorithms. But certainly, we're not going to see 10× a year for many more years, there's some fundamental things that will quickly break.”
That growth in cloud compute has come as the industry has learned and borrowed from the traditional supercomputing sector, allowing for a rapid increase in how much the hyperscalers can offer as single clusters.
But now that they have caught up, fielding systems that would be among the top 10 of the Top500 list of fastest supercomputers, they are having to pave their own path.
“The two sectors are converging, but what we and others are doing is fairly different from [traditional] supercomputing, in that it really brings together the end-to-end data sources in a much more dramatic way,” Vahdat said.
“And then I would also say that the amount of specialization we're bringing to the problem is unprecedented,” he added, echoing Professor Matsuoka’s concerns about diverging HPC types (see part III).
“In other words, a lot of what these models are doing is they're essentially pre-processing just enormous amounts of data. It’s not the totality of human knowledge, but it’s a lot, and it’s becoming increasingly multimodal.” Just preparing the input properly requires data processing pipelines that are “unprecedented.”
Equally, while HPC has coupled general-purpose processors with super low latency networking, this workload allows for slightly higher latency envelopes, tied to an accelerated specialized compute setup.
“You don't need that ultra-tight, almost nanosecond latency with tremendous bandwidth at the full scale,” Vahdat said.
“You still need it, but at medium to large scale not at the extra large scale. I do see the parallels with supercomputing, but the second and third-order differences are substantial. We are already into uncharted territory.”
The company differentiates itself from traditional HPC by calling it “purpose-built supercomputing for machine learning,” he said.
At Google, that can mean large clusters of its in-house TPU chip family (it also uses GPUs). For this type of supercomputing, it can couple 4,096 TPUv4s. “It's determined by your topology. We happen to have a 3D Torus, and the radix of your chip,” Vahdat said, essentially meaning that it is a question of how many links come out of every chip and how much bandwidth is allocated along every dimension of the topology.
“So 4,096 is really a technology question and chip real estate question, how much did we allocate to SerDes and bandwidth off the chip? And then given that number and the amount of bandwidth that we need between chips, how do we connect the things together?”
Vahdat noted that the company “could have gone to, let's say double the number of chips, but then we would have been restricting the bandwidth. So now you can have more scale, but half the bisection bandwidth, which was a different balance point.”
The sector could go even more specialized and build clusters that aren’t just better at machine learning, but are specifically better at LLMs - but for now, the sector is moving too fast to do that.
However, it is driving Google to look beyond what a cluster even means, and stitch them together as a single larger system. That could mean combining several clusters within a data center.
But, as these models get larger, it could even mean multiple data centers working in tandem. “The latency requirements are more smaller than we might think,” he said. “So I don't think that it's out of the question to be able to couple multiple data centers.”
All of this change means that traditional lines of what constitutes a data center or a supercomputer are beginning to blur. “We are at a super exciting time,” he said. “The way that we do compute is changing, the definition of a supercomputer is changing, the definition of computing is changing.
“We have done a lot in the space over the past couple of decades, such as with TPUv4. We're going to be announcing the next steps in our journey, in the coming months. So the rate of hardware and software innovation is not going to be slowing down in the next couple of years.”