Frontier supercomputer trains one trillion parameter LLM on just over 3,000 GPUs

New research paper discusses how this was achieved

Researchers at Oak Ridge National Laboratory have published a research paper detailing how they trained a one trillion parameter LLM on the Frontier supercomputer using only 3,072 of its 37,888 GPUs.

The team also detailed how it was able to train a 175 billion parameter LLM using only 1,024 of the supercomputer’s GPUs. A one trillion parameter LLM is on the same scale as OpenAI’s GPT4 model.

There are a number of challenges that come with training LLMs with billions of parameters, such as the considerable compute resources and memory required. In order to overcome this, the researchers investigated data parallel training techniques and their on memory footprint, communication latency, and GPU's computational efficiency. This allowed researchers to use “hyperparameter tuning” to find the most efficient strategies for training large LLMs.

The results saw GPU throughputs of 31.96 percent achieved for the one trillion parameter model, and 36.14 percent for the 17bn parameter model. Furthermore, for both these models, researchers achieved 100 percent weak scaling efficiency and strong scaling efficiencies of 89 percent for the 175bn parameter model and 87 percent for the one trillion parameter model.

However, the research paper did not provide any information about how long it took to train the models using this method.

The Frontier supercomputer has an HPL (High-Performance Linpack) benchmark score of 1.194 exaflops, uses AMD Epyc 64C 2GHz processors, and is based on the HPE Cray EX235a architecture. The system has a total of 8,699,904 combined GPU and CPU cores, and uses HPE's Slingshot 11 network for data transfer.

In November 2023 it was awarded the top spot in the Top500 list of the world's fastest supercomputers.

Frontier supercomputer trains one trillion parameter LLM on just over 3,000 GPUs

More in HPC & Quantum

Energy efficiency is driving many HPC users to the cloud

UK government announces £45m investment for quantum projects

Episode How innovative approaches to server architecture are enabling HPC

More in North America

DCD>Connect | Virginia 2024 Past Delegate List

Microsoft and PNNL use AI and HPC for battery materials research

Virginia’s Prince William County planning commission recommends denial of PW Digital Gateway data center project

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies