The new, modern data center is not just an expansion of an existing data center. In many cases, adding AI capabilities to an enterprise requires precise planning and a fresh start. Simply adding a new GPU-optimized server to an existing infrastructure will not provide the results that organizations require. AI factories produce knowledge from existing data and require new thinking for optimal results.

Rack-level mindset

In the past, individual servers were added to an existing set of systems in a rack based on some capabilities (CPU speed, CPU GHz, amount of memory, and a GPU choice).

Over time, this building up of a data center led to a rack containing many different systems for different workloads, with each server basically self-contained. While some applications were designed to run across a number of servers (HPC), this included a knowledge of the networking protocols, additional software, and delays while the systems were communicating.

The new way of thinking is that the “rack is the new server” enables data center operators to create a scalable solution by thinking at the rack level.

Within a rack, an entire solution for AI training can be self-contained, with expansion for higher needs for performance readily available.

A single rack can contain up to eight servers, each with eight interconnected GPUs. Then, each GPU can communicate with many other GPUs located in the rack, as the switches can be contained in the rack. The same communication can be set up between racks for scaling beyond a single rack, enabling a single application to use thousands of GPUs.

Within an AI factory, different GPUs can be used. Not all applications or their agreed-upon SLAs require the fastest GPUs on the market today. Less powerful GPUs may be entirely adequate for many environments and will typically consume less electricity.

In addition, these very dense servers with GPUs require liquid cooling, which is optimal if the coolant distribution unit (CDU) is also located within the rack, which reduces the hose lengths.

The assembly and testing of entire clusters are important to the fast bringing up of a new AI factory. The ability of a single vendor to test all of the components that go into an AI factory to the customers' requirements reduces the chance of issues when installing the different components for the first time at a customer site.

unnamed (1)
Plug-and-play liquid-cooled AI solution – Supermicro

The L12 (cluster) integration not only tests the hardware and networking components but also tests the software environment running on the entire cluster, not just a single server.

Learn more about rack-scale integration.

Liquid cooling

The latest generations of CPUs and GPUs are pushing servers towards liquid cooling. The ability of forced air to cool servers that will soon exceed 10kW is becoming more difficult with each new CPU and GPU technology.

Racks are now approaching housing systems that, in total, require close to 100 kW of power, and thus, the heat to be removed from the system to keep it running at the designated performance. Enter liquid cooling, which is becoming more mainstream, especially for AI and HPC environments, where the CPUs and GPUs are expected to run at full (or boost) speed continuously. Liquid cooling has the ability to remove hundreds of times more heat than air while also reducing the data center cooling infrastructure requirements.

Learn more about data center liquid cooling.

Contrary to many beliefs, a liquid-cooled data center does not cost more to build than an air-cooled data center, and through a lower OPEX (PUE is reduced), the savings will be apparent for years after the buildout. The benefits of a liquid-cooled data center can be summarized:

  1. Lower power usage effectiveness (PUE) – there is less power being used outside of the servers, storage, and networking infrastructure
  2. More compute power – with reduced power consumption (lower PUE), more servers can be installed within the same budget for a given input power to the data center
  3. Faster computing – Liquid cooling can allow the CPU to run at their "boost" rate for longer, as the CPUs can be kept cooler, thus, no throttling

An entire liquid-cooling solution needs to have cold plates that replace the heat sinks that sit on top of the CPUs and GPUs. Hose kits are needed to get the cold liquid to the right hardware and remove it from the same hardware.

Coolant distribution manifolds deliver the cold fluid to the servers and return the hot liquid to the coolant distribution units (CDU). The CDU then sends the hot liquid to a cooling or water tower to bring the temperature of the fluid back to where it can be sent to the servers.

Summary

A new AI factory is unlike an existing data center. With high-end servers containing multiple GPUs, a rack becomes the base unit for further expansion. Then, these base units can be scaled up to entire data centers, with each GPU directly connected to other GPUs for a massively parallel AI training machine. Liquid cooling is critical for these highly dense servers as the TDP of the CPUs and GPUs continues to increase.

Learn more about Supermicro liquid cooling solutions