Facebook owner Meta has shared an AI platform design at the Open Compute Platform (OCP) summit, where hyperscale players offer cost-saving open source hardware designs for all to use.
Meta showed the Grand Teton AI platform, along with a new implementation of OCP's Open Rack v3 standard, and a new HDD storage system.
The OCP also used its summit to announce that it supports sustainability.
"As AI models become increasingly sophisticated, so will their associated workloads," says Meta VP of engineering Alexis Bjorlin, in a blog post announcement. The Grand Teton GPU-based AI hardware platform has four times the bandwidth of its predecessor Zion, and comes in a single integrated chassis, whereas Zion was packaged in multiple subsystems.
"The previous-generation Zion platform consists of three boxes: a CPU head node, a switch sync system, and a GPU system, and requires external cabling to connect everything," says Bjorlin. "Grand Teton integrates this into a single chassis with fully integrated power, control, compute, and fabric interfaces for better overall performance, signal integrity, and thermal performance."
Open Rack V3 - worth the wait?
Open Rack V3 was announced in 2019, allowing DC busbars and liquid cooling, and Meta's latest racks implement and enhance this. The power shelf can be installed anywhere in the rack, with multiple shelves on a single busbar, so power densities can go to 30kW per rack. The 48V power distribution allowed in ORV3 will support power hungry AI hardware such as Grand Teton.
Facebook has upgraded the battery backup unit, so each rack can continue to operate for four minutes if power is interrupted, compared with a previous limit of 90 seconds. The unit can be installed flexibly, and supports 15kW; two can be installed for those 30kW racks.
Rack watchers might feel there's been a lengthy time between the announcement and delivery of some of these features, but Bjorlin says this is inherent to the open source, community-led OCP process, and will be worth the wait.
"Meta chose to develop almost every component of the ORV3 design through OCP from the beginning," he explains. "While an ecosystem-led design can result in a lengthier design process than that of a traditional in-house design, the end product is a holistic infrastructure solution that can be deployed at scale with improved flexibility, full supplier interoperability, and a diverse supplier ecosystem."
With power levels increasing, ORV3 allows multiple liquid cooling options, including air-assisted liquid cooling (AALC) and facility water cooling, where the racks plug into a circulation system.
ORV3 now includes an option for a quick, non-drip "blind mate" connector, that has emerged from the ORV3 Blind Mate Interfaces Group which Meta set up in 2020. This allows IT gear to be plugged into the liquid manifold without drips, for easier servicing. the OCP standard specification covers connectors, manifolds, and hose and tubing requirements.
Bjorlin hints that liquid cooling will have to become more widespread at Meta: "You might be asking yourself, why is Meta so focused on all these areas? The power trend increases we are seeing, and the need for liquid cooling advances, are forcing us to think differently about all elements of our platform, rack and power, and data center design."