Squaring the circle: The high-performance computing energy paradox

There’s a paradox at the heart of the current rush to roll-out resources to support AI: on the one hand, the compute-intensive AI applications coming down the line will require radically more powerful GPUs, drawing far greater kilowatts, in order to run those applications efficiently.

Organizations, public and private, cloud providers and data center operators are therefore turning to high performance computing (HPC).

On the other hand, data center operators – and, therefore, cloud computing service providers, too – are under orders to reduce power consumption and become even more efficient suppliers of computing resources.

And yet there’s an increasing range of evidence and resources that indicate high-performance computing can provide more energy efficiency than traditional, non-accelerated data center servers.

Power efficiency

Nvidia whose full-stack accelerated computing platform speeds up HPC workloads, suggests that organizations ought to look at compute energy efficiency in a different way – emphasizing the amount of compute-per-watt they’re providing. According to Nvidia, energy efficiency refers to maximizing the amount of computational work completed for the amount of energy consumed and is typically measured in ‘tasks per kilowatt-hour'.

Moreover, whether it’s AI, meteorological forecasts, pharmaceutical research, or other heavyweight applications, they are performed on HPC infrastructure because no other compute resource can match what HPC can do, and the benefits far outweigh the costs.

HPC is invariably all about complex mathematical calculations (compared to, say, fetching data from storage.) Therefore, the power of HPC is increasingly driven by GPUs – first used to beef-up CAD and engineering software – rather than CPUs.

Today, the Nvidia H100 Tensor Core GPU enables AWS users to run their Amazon Elastic Compute Cloud (Amazon EC2) P5 instances to scale generative AI at the click of a button, and data center operators report unprecedented demand from clients for compute power optimized for AI. That means servers overwhelmingly powered by GPUs rather than CPUs, running at higher wattages and, therefore, liquid cooled rather than air cooled.

But how does that square with the pressure on the IT industry to ratchet-up energy efficiency as worldwide data center power consumption is forecast to double between now and 2030 to almost 1,300 TWh, accounting for almost five percent of global electricity consumption?

Research by the US National Energy Research Scientific Computing Center (NERSC) indicates that shifting towards GPU-powered HPC could save a considerable amount of energy. It measured how fast a number of applications ran on CPU-only and GPU-accelerated nodes on its latest HPC infrastructure powered by Nvidia A100 Tensor Core GPUs.

At the same time, it measured the energy consumed by these applications in order to calculate exactly how energy-efficient – or otherwise – running intensive applications on HPC really is.

It found that its Nvidia-powered HPC was five-times more energy efficient, on average and, when it came to weather forecasting, it recorded an energy efficiency improvement of just under tenfold.

In terms of energy and cost savings, on a server with four Nvidia A100 GPUs, NERSC achieved a 12-times increase in performance compared to a dual-socket x86 server. For the same level of performance, NERSC researchers concluded that the Nvidia GPU-accelerated system would consume 588 megawatt-hours less energy per month than a CPU-only system, amounting to cost savings of about $4 million running the same workload.

Overall, if all CPU-only data center servers migrated to Nvidia GPUs for HPC, AI, and data analytics workloads, organizations would save 12TWh annually. That equates to global savings of between $2 and $3 billion.

In or out? The case for HPC in the cloud

There are many good reasons to turn to the cloud for HPC.

First, on-premises HPC requires a vast upfront investment, and investment in skilled IT personnel as well as hardware and software. It will need ongoing maintenance, of course, and enough power capable of handling its demands. And the organization will need to be sure that it has sufficient work to run the hardware round-the-clock, too, in order to maximize utilization and return on investment.

Alternatively, if demand exceeds availability, users must take a ticket, queue up and wait their turn – and their number could take weeks to come up, if they’re lucky.

Therefore, an organization looking to invest in HPC infrastructure needs to conduct a thorough analysis of its needs to ensure that it will get full usage from its investment.

For example, not all analytics require the sheer compute power of HPC, so a thorough cost-benefit analysis is the obvious first step. This also needs to factor in the cost of upgrades, and even replacement cycles. And are all the costs of running HPC, from staffing to power, being taken fully into account? It’s easy to overlook or underestimate many of these factors, and either under or over-provisioning will be an expensive mistake.

If the answers to all these questions are positive, the organization will also need to consider ongoing investments in HPC as workloads become more complex and demand for services grows.

For example, as HPC simulations grow in complexity and size, they require increasingly large computational resources that use massively parallel computing, and this growth needs to be accounted for. HPC teams have the challenge of building high-performance networks, with fast storage and large amounts of memory, while maintaining or reducing total power consumption.

The pay-as-you-go model offered by HPC-in-the-cloud offers the same benefits as running any other application in the cloud: the overhead of hardware maintenance is someone else’s problem, and you only pay for what you use.

“For those workloads that have sufficiently long run-times with demanding data capacity and movement requirements, a cloud-based variable cost model may result in a higher cost than an equivalent on-premises system.

“Conversely, the pay-as-you-go model may be more cost effective for workloads that exhibit shorter run times and less demanding data requirements than when the on-premises solution isn’t fully utilized.

“On top of its elasticity and non-committal nature, a cloud-based pay-as-you-go structure also enables users to always have access to the most up-to-date technology without sizeable up-front expenditures,” note analysts at Hyperion Research, in a Technology Spotlight entitled, ‘HPC and the Cloud – A Strong and Maturing Relationship.’

That model also benefits HPC users in terms of software licensing and usage. Software that runs in the cloud is automatically updated, and it is easier and less costly to shift from one software package to another, more appropriate alternative. Hence, both hardware and software can be kept up-to-date, and with considerably less hassle for users.

The power of two

Amazon Web Services (AWS) has been offering cloud services for more than 20 years. Bringing together compute, networking, and storage into cloud-based data centers has enabled a huge improvement in data center efficiency, and the shift to Nvidia GPU-based HPC represents another leap forward in efficiency, including energy efficiency.

AWS works closely with Nvidia to continuously develop both the hardware that powers AWS HPC offerings, as well as the software resources to bring it within the fingertips and financial resources of ordinary businesses and other organizations, without the upfront costs of outright purchase.

While AWS has offered machine learning (ML) solutions for more than 10 years, it has worked with Nvidia to accelerate compute-intensive workloads, using GPU-powered HPC, with EC2 instances running on the latest Nvidia H100 GPUs. This means EC2 instances can now run at up to six times faster compared to previous generations.

The collaboration of AWS and Nvidia goes further than just hardware. On top of running the latest HPC-optimized EC2 instances on Nvidia GPUs, it also encompasses high-performance file systems, high throughput networking, and SDKs.

“Building on their respective areas of expertise, AWS and Nvidia have collaborated to develop the necessary tools to enable users to run their HPC workloads easier and more effectively. These tools allow users to extract performance from the latest Nvidia GPUs and pull together the necessary pieces of the workload cycle to run the jobs at the right scale.

Both AWS and Nvidia have collaborated to bring software and toolkits to help improve the skills of HPC users across the span of their workflows. Nvidia has developed SDKs that help users tackle the challenges of deploying HPC workloads across GPU technologies, optimizing applications to run on the Nvidia platform,” notes Hyperion Research.

One recent customer, Reezocar, an online hub for buying and selling cars, used HPC and ML powered by Nvidia GPUs on AWS to meticulously detect car dents and imperfections. Its technology can estimate repair costs in milliseconds and helps extend the serviceable life of vehicles, steering them away from premature disposal. By accurately detecting vehicle damage and assessing refurbishment needs, Reezocar's trade-in processes are faster and more efficient.

Running HPC applications in the cloud doesn’t just save on upfront investment costs – and ongoing running, maintenance and upgrades – but enables data scientists, engineers, and researchers to use the highest-performance tools on the fastest and most efficient infrastructure.

Learn more about how AWS and Nvidia can help accelerate HPC workloads here.

Squaring the circle: The high-performance computing energy paradox

Power efficiency

In or out? The case for HPC in the cloud

The power of two

More from AWS

NTT Docomo chooses AWS for commercial Open RAN deployment in Japan

Amazon Web Services launches upgraded Atlanta Local Zone Edge location

AWS plans $5bn Mexican cloud region for 2025

Tags

Data Center Networking Trends 2025

Customer guide to data center decarbonization

Digital Twins for Data Centers

Are You Data and AI Ready?