Researchers at the University of Michigan say they can reduce the energy consumption of AI training by up to 75 percent.

Deep learning models and large language models can be trained more efficiently without a change of hardware, so the process uses less energy, according to Michigan's ML.Energy group, which presented the Zeus energy optimization framework at the 2023 USENIX Symposium on Networked Systems Design and Implementation (NSDI), in Boston. 

98 OpenAI DALL·E 2 - A supercomputer studying dark matter, ukiyo-e art 2.png
– DCD/DALL·E 2

Artificial intelligence applications such as OpenAI's GPT-3 and GPT-4 are making ever-increasing demands on data center infrastructure, while their energy usage is largely ignored and not disclosed. The ML.Energy group believes that energy use of AI should be more openly exposed and discussed, to encourage optimization.

Hidden energy use

"At extreme scales, training the GPT-3 model just once consumes 1,287MWh, which is enough to supply an average US household for 120 years," said Mosharaf Chowdhury, an associate professor of electrical engineering and computer science.

Deep learning models are already widely in use for image generation, as well as for expressive chatbots and recommender systems for services like Netflix. TikTok, and Amazon. DCD's recent examination of the evolution of AI hardware revealed that energy demands are increasing fast.

"Existing work primarily focuses on optimizing deep learning training for faster completion, often without considering the impact on energy efficiency," said Jae-Won Chung, a doctoral student in computer science and engineering and co-first author of the study. "We discovered that the energy we're pouring into GPUs is giving diminishing returns, which allows us to reduce energy consumption significantly, with relatively little slowdown."

Deep learning techniques use multilayered artificial neural networks, also known as deep neural networks (DNNs). These are complex models, which are fed with massive data sets. Some 70 percent of the energy in AI training is burned within graphical processing units (GPUs).

Zeus gives the AI researchers two software "knobs." One sets a GPU power limit, lowering the GPU's power use and slowing down training until the setting is adjusted again. The other knob controls the batch size parameter, the amount of data the model assimilates before rebuilding itself. AI researchers often use large batch sizes to reduce training time, but this increases energy consumption.

Because Zeus can tune each of these settings in real-time, researchers can find the best tradeoff point, where energy usage is minimized with as little impact on training time as possible. The software is plugged directly into the existing workflows, and has been created for a number of machine learning tasks and GPUs.

In tests, the ML.Energy team tried every possible combination of the two parameters to find the best combination. In practice, that level of thoroughness won't be needed, they say: Zeus can take advantage of the repetitive nature of machine learning to come very close.

"Fortunately, companies train the same DNN over and over again on newer data, as often as every hour. We can learn about how the DNN behaves by observing across those recurrences," said Jie You, a recent doctoral graduate in computer science and engineering and co-lead author of the study.

The team has also created Chase, a higher layer of software that adjusts the Zeus parameters according to what energy is available. When the system is running on low-carbon energy, Chase makes speed the priority. When the carbon intensity is higher, it ramps back to higher efficiency at the expense of speed. Chase will be presented on May 4 at the International Conference on Learning Representations Workshop.

"It is not always possible to readily migrate DNN training jobs to other locations due to large dataset sizes or data regulations," said Zhenning Yang, a master's student in computer science and engineering. "Deferring training jobs to greener time frames may not be an option either, since DNNs must be trained with the most up-to-date data and quickly deployed to production to achieve the highest accuracy.

"Our aim is to design and implement solutions that do not conflict with these realistic constraints, while still reducing the carbon footprint of DNN training." 

Subscribe to our daily newsletters