The battle for the heart of the data center is heating up.
Once, the story was a simple one, of a server market dominated by Intel’s x86 CPUs, steadily refreshed in line with Moore’s Law.
But, as is the way of things, a monopoly bred complacency, slowing innovation and technological progress. That changed in 2017, when AMD aggressively returned to the server market, breaking into the market with the Epyc processor line.
This feature appeared in Issue 50 of the DCD Magazine. Read it for free today.
Now the company faces threats of its own - on the CPU side from a reinvigorated Intel, and a slew of Arm competitors. Over on the GPU side, where it has long played second fiddle to Nvidia, it has watched as its rival exploded in popularity, selling A100 and H100s by the ton.
At the company’s data center summit this summer, we caught up with AMD’s CTO Mark Papermaster to discuss its war on multiple fronts.
There, the chip designer’s big announcement was the Epyc 97X4 processor line, codenamed Bergamo. Using the new Zen 4c architecture, a 'cloud-native' version of Zen 4, Bergamo features a design that has a 35 percent smaller area, and twice as many cores, with the chips optimized for efficiency rather than just performance.
"We had to knock out an incumbent that had over 95 percent market share, you don't do that by saying 'I have a more efficient solution,' you have to knock them out by bringing in a more performant solution. But we did that. Now, we are able to add an efficient throughput computing device to our portfolio."
The company claims that hyperscalers have begun buying the new processor family at scale, drawn in by the cost savings of more energy-efficient chips. “At the end of the day, customers make decisions based on their total cost of ownership - they look at the compute they’re getting, what power they are spending, the floor space that they have to dedicate to their server, and that’s where we believe we have a significant advantage versus competitors,” Papermaster said.
“If we hadn't designed for this point, we would have left that open. But we believe with Bergamot, we have a compelling TCO story versus Arm competitors.”
Papermaster defended x86 against Arm, which is pitched as being more efficient. “People think ‘oh, Arm is inherently more efficient, Arm is always going to have a much, much smaller core,’” he said.
“But the biggest thing is the design point that you optimize for - if you take Arm and you optimize for the high-performance design point that we have in [AMD CPU] Genoa, and you had simultaneous multi-threading, support for instructions like 512 width vectors and neural net support, then you're gonna grow the area significantly.
“We went the other way - we had the high-performance core. And we said for cloud-native, let's optimize at a different point of the voltage and frequency curve, but add more cores.”
He added: “I think this will put a tremendous challenge in front of our Arm competitors.”
Beyond the CPU, AMD has also tried to compete in the accelerator space, operating as a distant second GPU designer.
As generative artificial intelligence became the biggest story of the year, Nvidia has dominated headlines, wooed investors, and broken sales records.
“Right now, there's no competition for GPU in the data center,” Papermaster admitted. “Our mission in life is to bring competition.”
That mission begins with the hardware, with AMD announcing a generative AI-focused MI300X alongside its more general-purpose AI and HPC version 300A. “Will there be more variants in the future?” Papermaster posited. “I'm sure there will.”
But hardware only gets you so far, with Nvidia’s dominance extending to a broad suite of software used by AI developers, notably parallel computing platform CUDA.
“Our approach is open, and if you run in their stranglehold we can port you right over, because we're a GPU,” Papermaster said. “We have a portability tool that takes you right from CUDA to ROCm.”
ROCm doesn’t support the full CUDA API, and portability mileage might vary based on the workload. Developers still attest that CUDA is superior to a ROCm port, despite Papermaster’s claims.
“You have some tuning you have to do to get the best performance, but it will not be a bottleneck for us,” Papermaster said, noting that most programmers are not writing at the lowest level, and instead primarily use PyTorch.
AMD is also in the early stages of using AI to inform the future of its own chip design. “We have created an AI group within the company, which is identifying applications that could benefit from both predictive AI as well as generative AI. In chip design itself, we're finding that generative AI can speed our design processes in how we place and route the different elements and optimize the physical implementation.
“We're finding that it is speeding our verification on those circuits, and even our test pattern generation because you can run a model, and it'll tell you the fastest way to create accurate test patterns. We're also using it in our manufacturing, because we have analyzed all the yield data, when you test our chips at our manufacturing partners, and are identifying spot areas that might not be at the most optimum productivity point.”
How deeply AMD embraces AI is yet to be seen, and it's equally unclear how long the current AI wave will last. “Our determination is AI is not a fad,” Papermaster said, knocking on wood.
To meet this computing momentum requires AMD and its competitors to fire on all cylinders. “You have to have a balanced computer,” Papermaster said. “We have to attack all the elements at once. There's not one bottleneck: Every generation, what you see us do is improve the Compute Engines, improve the bandwidth to memory, the network conductivity, and the I/O connectivity.
“We are big believers in a balanced computer. As soon as you get fixated on one bottleneck, you're screwed.”