Even with the huge investments made in building out supercomputers in the cloud or in the lab, problems can arise.
“Recently, we saw that due to some issue with the GPUs on our cluster, we actually had to underclock them, because they would just blow past 500 watts a GPU at full throttle, and that would basically burn the GPU and your run would die,” EleutherAI’s Shivanshu Purohit said.
“Even the cloud provider didn't consider it because they thought it shouldn't happen, because it doesn't usually happen. But then it did.”
Similarly, high energy particles “can break through all the redundancies and corrupt your GPU,” he said.
“There might be new problems as we scale beyond where we are right now, there's a limit to how many GPUs you can store in a single data center. Currently, the limit is around 32,000, both due to power and challenges on how to actually design the data center.”
Perhaps the answer is not to build ever larger data centers, but instead move away from GPUs.
Computing’s new wave
Over the past half-decade, as Moore’s Law has slowed and other AI applications have proliferated, AI chip companies have sprouted like mushrooms in the rain.
Many have failed, or been acquired and asset-stripped, as a promised AI revolution has been slow to occur. Now, as a new wave of compute again seems poised to flood data centers, they are hopeful that their time has come.
Each company we spoke to believes that its unique approach will be able to solve the challenge posed by ever-growing AI models.
“We believe our tech is uniquely good at where we think models are going to go,” said Matt Mattina, head of AI at chip startup Tenstorrent.
“If you buy into this idea that you can't just natively get to 10 trillion parameters, or however many trillions you want, our architecture has scaling built in.
“So generative AI is fundamentally matrix multiplies [a binary operation that produces a matrix from two matrices], and it’s big models,” he continued. “For that, you need a machine that can do matrix multiply at high throughput and low power, and it needs to be able to scale. You need to be able to connect many, many chips together.
“You need a fundamental building block that's efficient in terms of tops (Tera Operations Per Second) per watt, and can scale in an efficient way, which means that you don't need a rack of switches when you add another node of these things.”
The company’s chips each have integrated Ethernet, “so the way you scale is you just connect the chips together over standard Ethernet, there's not a labyrinth of switching and stuff as you go to bigger sizes,” and the company claims its software makes scaling easy.
“It is a very promising architecture,” SemiAnalysis’ Dylan Patel said. “It's very interesting from a scaling and memory standpoint and a software programmability standpoint. But none of that is there yet.
The hardware exists in some capacity and the software is still being worked on. It's a tough problem for them to crack and be usable, and there's a whole lot that still needs to be done.”
Rival Cerebras has a different approach to scaling: Simply make the chip larger.
The Wafer Scale Engine 2 (WSE-2) chip has 2.6 trillion transistors, 850,000 'AI optimized' cores, 40GB of on-chip SRAM memory, 20 petabytes of memory bandwidth, and 220 petabits of aggregate fabric bandwidth. It is packaged in the Cerebras CS-2, a 15U box that also includes an HPE SuperDome Flex server.
“When these big companies are thinking about training generative AI, they're often thinking of gigaflops of compute,” Cerebras CEO and co-founder, Andrew Feldman, said. “We're more efficient [than the current GPU approach], for sure, but you're still going to use an absurd amount of compute, because we're training in a sort of brute force manner.”
Feldman again believes that there will be a limit to the current approach of giant models, “because we can't go bigger and bigger forever, there's some upper bound.” He thinks sparsity approaches will help bring model sizes down.
Still, he agrees that whatever the models, they will require huge compute clusters. “Big clusters of GPUs are incredibly difficult to use,” he said. “Distributed compute is very painful, and distributing AI work - where you have to go tensor model parallel, and then you have to go pipeline model parallel, and so on - is an unbelievably complicated process.”
The company hopes to solve some of that challenge by moving what would be handled by hundreds of GPUs onto one multi-million dollar mega-chip.
“There are two reasons you break up work,” he said. “One is you can't store all the parameters in memory, second reason is that you can't do a calculation that is needed, and that's usually a big matrix multiply in a big layer.”
In the 175bn parameter GPT-3, the largest matrix multiply is about 12,000 by 12,000. “We can support hundreds of times larger, and because we store our parameters off-chip in our MemoryX technology, we have an arbitrarily large parameter store - 100-200 trillion is no problem,” he claimed. “And so we have the ability to store vast numbers of parameters, and we have the ability to do the largest multiplication step.”
The single huge chip is not big enough for what the biggest models require, however. “And so we built Andromeda, which is 13.5 million cores. It's one and a half times larger than [Oak Ridge’s exascale system] Frontier in core count, and we were able to stand it up in three days. The first customer put on it was Argonne [another US national computing laboratory], and they were doing things they couldn't do on a 2,000 GPU cluster.”
The Andromeda supercomputer, available over the cloud, combines 16 of Cerebras’ CS-2 systems, but Cerebras has the potential ability to scale to 192 such systems as one cluster. “The scaling limitation is about 160 million cores,” said Feldman.
Cerebras is not the only company to offer its specialized hardware as a cloud product.
“We have decided to change our business model from selling hardware to operating an AI cloud,” Simon Knowles, the CTO of British AI chip startup Graphcore, said.
“Is it realistic to set up and operate an AI cloud? Clearly, it's sensible because of the enormous margins that Nvidia is able to harvest. The real question is, is there a market for a specialized AI cloud that a generic cloud like AWS doesn't offer? We believe, yes, there is, and that is with IPUs.”
The company’s IPU (Intelligence Processing Unit) is another parallel processor designed from the ground up for AI workloads.
“IPUs have been designed from day one with a mandate not to look like GPUs,” Knowles said. “I'm amazed how many of the startups have tried to basically be an alternative GPU. The world doesn't need another Nvidia; Nvidia are quite good.”
He believes that “what the world needs is machines of different shapes, which will perform well on things where Nvidia can clearly be beaten.” That’s part of the reason why Graphcore is building its own cloud. While it will still sell some hardware, it found that customers won’t commit to buying hardware, because they want it to be as good as or better than Nvidia GPUs on all workloads.
“They wanted insurance that it would satisfy all their future needs that they didn't know about,” he said. “Whereas, as a cloud service, it's like ‘for this set of functions, we can do it at half the price of them.’”
Equally, he does not want to compete with AWS on every metric. “You'd have to be quite bold to believe that one cloud based on one technology could do everything well,” he said.
Another startup offering specialized hardware on the cloud, on-prem, or as a service, is SambaNova. “As the models grow, we just believe that [SambaNova’s architecture] Dataflow is what you're going to need,” CEO Rodrigo Liang said. “We just believe that over time, as these models grow and expand, that the power required, the amount of cost, all those things will just be prohibitive on these legacy architectures.
“So we fundamentally believe that new architecture will allow us to grow with the size of the models in a much more effective and much more efficient way, than the legacy ways of doing it.”
But the incumbent legacy chip designers have also fielded hardware aimed at serving the training and inference needs of the latest AI models.
"Habana Gaudi has already been proven to be like 2× the performance of the A100 GPU on the MLPerf benchmark,” Dr. Walter Riviera, Intel’s AI technical lead EMEA, claimed of the company’s deep learning training processor.
“When it comes to the GPU, we have the Flex series. And, again, depending on the workload, it is competitive. My advice for any customers out there is test and evaluate what's going to be best for them.”
AMD has in recent years clawed CPU market share from Intel. But in the world of GPUs it has the second-best product on the market, SemiAnalysis’ Dylan Patel believes, and has yet to win a significant share.
“If anyone is going to be able to compete, it's the MI300 GPU,” he said. “But it's missing some things too, it's not there in the software, and there are some aspects of the hardware that are going to be more costly. It's not a home run.”
AMD's data center and accelerated processing CVP Brad McCredie pointed to the company’s leadership in HPC as a key advantage. “We’re in the largest supercomputer on three continents,” he said. “Such a big piece of this exploding AI mushroom is scale, and we've demonstrated our scale capability.
McCredie also believes that AMD’s successes with packing a lot of memory bandwidth onto its chips will prove particularly compelling for generative AI. “When you go into the inferencing of these LLMs, memory capacity and bandwidth comes to the fore. We have eight stacks of high-bandwidth memory on our MI250, which is a leadership position.”
Another key area he highlighted is power efficiency. “When you start getting to this scale, power efficiency is just so important,” he said. “And it's going to keep growing.”
Then there’s the tensor processing unit (TPU), a custom AI chip family developed by Google - the same company that came up with the transformer model that forms the basis of current generative AI approaches.
“I think one of the main advantages of TPUs is the interconnect,” researcher Finbarr Timbers said.
“They have really high networking between chips, and that's incredibly useful for machine learning. For transformers generally, memory bandwidth is the bottleneck. It's all about moving the data from the RAM on the machine onto the on-chip memory, that's the huge bottleneck. TPUs are the best way to do this in the industry, because they have all of this dedicated infrastructure for it.”
The other advantage of the chip is that it’s used by Google to make its largest models, so the development of the hardware and models can be done in tandem.
“It really comes down to co-design,” Google’s Amin Vahdat said. “Understanding what the model needs from a computational perspective, figuring out how to best specify the model from a language perspective, figuring out how to write the compiler, and then map it to the hardware.”
The company also touts the TPU’s energy efficiency as a major advantage as these models grow. In a research paper, the company said that its TPUv4s used DSAs ~2-6× less energy and produced ~20× less CO2e than contemporary chip rivals (not including H100) - but the major caveat is that it was comparing its hyperscale data center to an on-premise facility.
Amazon also has its own Trainium chip family. It has yet to make as much of a splash, although Stability AI recently announced that it would look at training some of its models on the hardware (likely as part of its cloud deal with AWS).
"One capability that I would like to highlight is hardware-accelerated stochastic rounding," said AWS’ director of EC2, Chetan Kapoor.
“So stochastic rounding is a capability that we've built in the chip that intelligently says, okay, am I going to round a number down or up?,” he said, with systems normally just rounding down. “It basically means that with stochastic rounding you can actually get the throughput of FP16 datatype and the accuracy of FP32.”
Nvidia: The king of generative AI
Nvidia has not been napping - and chip rivals that hope to disrupt its fat margins will find the task daunting, like Microsoft's Bing nibbling away at Google's image of search superiority.
Rather than seeing this as an end to its dominance and a 'code red' moment akin to what's happening at Google, Nvidia says this is the culmination of decades of preparation for this very moment.
“They've been talking about this for years,” SemiAnalysis’ Patel said. “Sure they were caught off guard with how quickly it took off in the last few months, but they were always targeting this. I think they're very well positioned.”
Outside of Google’s use of TPUs, virtually all the major generative AI models available today were developed on Nvidia’s A100 GPUs. The models of tomorrow will primarily be built with its newly-lanched H100s.
Decades of leading the AI space has meant that an entire sector has been built around its products. “Even as an academic user, if I were to be given infinite compute on those other systems, I would have to do a year of software engineering work before I can even make them useful because the entire deep learning stack is on Nvidia and Nvidia Mellanox [the company’s networking platform],” EleutherAI’s Anthony said. “It's all really a unified system.”
Colleague Purohit added: “It’s the whole ecosystem, not just Mellanox. They optimize it end-to-end so they have the greatest hardware. The generational gap between an A100 and H100 from the preliminary tests that we have done is enough that Nvidia will be the compute king for the foreseeable future.”
In his view, Nvidia has perfected the hardware-improves-software-improves-hardware loop, “and the only one that competes is basically Google. Someone could build a better chip, but the software is optimized for Nvidia.”
A key example of Nvidia’s efforts to stay ahead was its launch of the tensor core in late 2017, designed for superior deep learning performance over regular cores based on Nvidia’s CUDA (Compute Unified Device Architecture) parallel platform.
“It changed the game,” Anthony said. “A regular user can just change their code to use mixed precision tensor cores for compute and double their performance.”
Now, Nvidia hopes to push things further with a transformer engine in the H100, for FP8. “It’s a hardware-software combination, actually,” Nvidia’s head of data centers and AI, Ian Buck, said. “We basically added eight-bit floating point capability to our GPU, and did that intelligently while maintaining accuracy.”
A software engine essentially monitors the accuracy of the training and inference job along the way, and dynamically lowers things to FP8.
“Tensor cores killed FP32 training entirely. Before that everything was on FP32,” Anthony said. “I don't know if the move to FP8 will be the same, maybe it is not enough precision. We’re yet to see if deep learning people can still converge their models on that hardware.”
But just as the Tesla GPUs in Summit are too old for today’s challenges, H100s won’t be up for the models of the future.
“They're evolving together,” Buck said, pointing out that Nvidia’s GTX 580 cards were used to build AlexNet, one of the most influential convolutional neural networks ever made, way back in 2012.
“Those GPUs are completely impractical today, a data center could not even be built to make them scale for the models of today, it would just fall over,” Buck said.
“So are current GPUs going to get us to 150 trillion parameters? No. But the evolution of our GPUs, the evolution of what goes into the chips, the architecture itself, the memory interconnect, NVLink, and data center designs, will. And then all the software optimizations that are happening on top is how we beat Moore's Law.”
For now, this market remains Nvidia’s to lose. “As everyone's trying to race ahead in building these models they're going to use [Nvidia’s] GPUs,” Patel said. “They're better and easier to use. Generally, actually, they're cheaper too when you don't have to spend as much time and money on optimizing them.”
This may change as models mature. Currently, in a cut-throat space where performance and speed of deployment are at a premium, Nvidia represents the safe and highly capable bet.
As time goes on and that pressure eases, companies may look to alternative architectures and optimize deployments on cheaper gear.