High-performance computing (HPC) is advancing multiple sectors, but integrating and operating these technologies in data centers presents significant challenges, requiring strategic and innovative solutions. This raises the question: What are the real-world challenges in managing advanced compute systems?

June 2024 saw DCD team up with Schneider Electric, a global leader in energy management and automation solutions, to explore the complexities of HPC deployments and provide insights on how to overcome them. Vance Peterson, global solutions architect at Schneider Electric, brings over 20 years of experience in the mission-critical industry. At Schneider, he collaborates with industry leaders to help customers address their most complex challenges, particularly as the data center sector is driven by digitization and the need to enhance sustainability while addressing capacity constraints.

Evaluating readiness for high-density computing

As we set the table for the future of data centers, it’s crucial to prepare every element carefully to support high-density computing workloads. This involves a comprehensive assessment of existing infrastructure, covering everything from utility service connections to the detailed components within the data center, such as electrical, mechanical, UPS, cooling, and power distribution systems.

Just as a well-prepared host ensures every dish is perfectly paired, Schneider provides tools and expertise to guide these evaluations:

GettyImages-2158128353
– Getty Images

“Schneider can offer support and guidance from our agnostic trade-off tools that are available on the public website. We can help clients in a number of ways: conducting detailed power audits, PUE effectiveness assessments, load and capacity analysis, thermal imaging, and computational fluid dynamics simulations to understand their current capabilities, and also forecast within a digital twin environment, what their reality might look like the day after tomorrow.”

Foundational steps for adopting HPC

Preparing for HPC and AI adoption involves several foundational steps to ensure that organizations are not only equipped with the right technology but also have the necessary skills and infrastructure. Peterson explains:

“Clients really need to consider where exactly their organization is from a systems and technology standpoint. In other words, take a good hard look at the organization's technology infrastructure, its data capabilities and its skill sets.”

Peterson goes on to describe the foundational steps that an enterprise might want to take to prepare for high performance compute:

  1. Define business objectives: Clearly outline the goals and expected outcomes from the adoption of HPC and AI
  2. Assess data readiness: Ensure data quality and availability, plus security in place to support the applications
  3. Invest in skill development: Identify gaps in training investments, and hire talent specifically with expertise in AI and HPC technologies
  4. Conduct infrastructure planning: Evaluate and plan for the necessary computational infrastructure and resources to support workloads, and address any regulatory or compliance requirements related to AI and HPC implementations
  5. As the final step before broad adoption, pilot a small project: Start with a small-scale project to test and validate the use case, the use of AI and HPC in real-world scenarios, and use that pilot to validate the economics or return on investment against projections

“For example, direct-to-chip liquid cooling or immersion cooling can address increasing densities associated with accelerated compute. This may present some skill gaps in an organization, and as such, operators and maintenance staff will need to scale up to support the needs of HPC.”

Accountability: Many hands make light work

Schneider Electric actively shares its expertise and fosters partnerships to support the design of HPC and AI-ready operations. As Peterson notes, the complexity of HPC presents challenges that the industry is collectively working to address:

“For instance, Schneider has partnered with Nvidia to combine our expertise in energy management and digitization. Together, we are developing optimized data center infrastructure that will pave the way for high-performance computing and accelerated advancements in Edge artificial intelligence and digital twin technologies.”

GettyImages-1475344848
– Getty Images

This collaboration is not just about meeting current needs, but also about setting new industry standards:

“These reference designs will help redefine the benchmarks for AI and operations within the data center, and it marks a significant milestone in our industry's evolution, with AI applications gaining traction across industries.”

Peterson emphasizes that the goal of this partnership:

“Through this collaboration, we really want to provide data center owners and operators with the tools and the resources they need to integrate new and evolving AI solutions into their infrastructure, allowing them to enhance deployment and efficiency and ensuring reliable lifecycle operations.”

The proof is in the pudding

As we prepare for the new energy landscape, adopting modular, scalable architectures is crucial for navigating the increasing densification of data center environments. This approach benefits both purpose-built facilities and retrofitted infrastructures, ensuring readiness for future demands. According to Peterson, several best practices are critical in this context:

  • Capacity planning: Thoroughly assess current and future power and cooling requirements to determine the necessary capacity upgrades
  • Advanced cooling solutions: Integrate liquid cooling technologies to efficiently dissipate the heat generated by high-density computing elements
  • Power system upgrades: Install additional power feeds, busways, or upgraded PDUs to ensure reliable power delivery at higher voltages and power densities
  • Energy efficiency measures: Utilize variable speed drives, hot/cold aisle containment, and airflow optimization to enhance overall efficiency and reduce cooling loads
  • Modular architecture: Deploy scalable, modular components that can be easily expanded to meet increasing power and cooling demands

Advanced monitoring and management: Implement systems to continuously track power usage, temperatures, and humidity, enabling proactive maintenance and optimization.

The ability to elevate infrastructure management from a previously reactive to a proactive approach means that the value of real-time architectural insights cannot be understated.

As the old adage goes, you can’t manage what you don’t monitor, and the unique – and not to mention dense – nature of HPC environments makes the implementation of advanced monitoring an essential tool in the effective maintenance and management of this complex infrastructure.

From power consumption to environmental constraints, by leveraging digital twin and other advanced metering technologies, operators can accurately predict how best to not only modify, but optimize their architectures going forward.

“The evolution and widespread adoption of HPC will likely make advanced monitoring and management a must-have rather than a nice-to-have. Electrification and decarbonization have increased electrical demand, and this, along with the increased demand driven by digitization, is really setting the table, if you will, for the new energy landscape.”

To find out more about making the transition to AI-ready data centers, check out ‘AI reference designs to enable adoption: A collaboration between Schneider Electric and Nvidia’ and ‘Transitioning to AI-ready data centers’ on the Schneider website.