There’s never a good time to make a mistake in a data center. In fact, working in such a facility pretty much renders ‘not making a mistake’ part of the job description. So, why would I say that a virtual mistake is a good mistake?

Gerd Altmann_Pixabay_error-101405_1920_small.jpg
– Gerd Altmann, Pixabay

Let’s take a step back for a moment from those harrowing moments when an alert suggests you’ve got an outage on your hands. Or the frustrating feeling when a demand comes in from the business that’s going to push your capacity into a zone you just don’t want to be in. Let’s think again about why we do what we do.

Something I learned early on in my time in the sector was that there were two meaningful ways of doing things. The first is how we’ve always done it. The second is the best way we can possibly do it. The first has its benefits because it avoids rocking the boat, but it undermines the ultimate purpose of the data center, which is to support the business. The second is definitely better for the business. However, discovering ‘the best way’ to do something is fraught with difficulties. And data centers aren’t really ‘fail fast, learn fast’ kind of environments.

But what if they could be?

The cost of downtime. The cost of underperformance

Make no mistake, I really appreciate the importance of avoiding downtime. Most of my professional life has been committed to avoiding it. At present, as many businesses and sectors remain fragile following the shock of the pandemic, downtime is less appealing than ever. The 2020 Uptime Institute Global Survey of data center managers found that 40 percent of outages now cost between $100,000 and $1 million, and that’s before you consider the impact on a brand’s reputation.

Not to pick on any one brand – because we’re all at risk of these incidents – but the Microsoft Azure outage at the end of last year is a case in point. A cooling-related incident took the UK facility offline, and among the fallout from this was the very high-profile impact on the UK government’s Covid-19 information portal: not something that could be quietly swept under the carpet amid the pandemic.

For other organisations, whether it’s Black Friday or a big trading day, similar nightmare incidents sit there in the back (or front) of mind. This is likely to become a more common problem, too. The ‘digital transformation and home working revolution’ of the last 18 months has piled on top of the exponential growth in data and proliferation of corporate applications. The data center has to deliver so much for the business, and it’s assumed by almost everyone outside of infrastructure and operations that it can always just do it.

Embrace the chaos

Historically, large margins were left around capacity, thermal, and power limits. This is falling out of fashion quickly. So instead, infrastructure and operations professionals are having to just get things done. And this is where we return to our ‘fail fast’ culture. When changes are swarming in, you need to plan and manage change accurately and quickly. Using a digital twin of your data center enables you to do this, but your mistakes will only ever be virtual. The ‘learn fast’ bit can then be deployed in a real environment.

The digital twin is a virtual representation of your physical facility. It gives you an accurate and realistic simulation of your environment in which you can test scenarios – right down to switching individual servers or sockets. Using Computational Fluid Dynamics (CFD), it simulates airflow through the facility to help you understand thermal properties, hot spots, and other points of failure that could cause downtime.

It’s this approach that allows firms to learn from some of the tech sector’s bleeding edge innovators. Take the example of Netflix during their move to AWS in 2011. Using the chaos engineering principle of breaking things on purpose, they scrutinised the reliability of their systems under a vast array of scenarios. They experimented with the outcomes of failing servers and clusters, and of filling up random hard drives. It enabled them to reduce the mean time to resolution (MTTR) for incidents in their critical environments.

Fail fast without the failure

Want to know the fallout from a cooling unit failure? Keen to understand the implications of the racks running the passive half of an active-passive redundant application ramp up? Want to test what happens if you switch a few servers? Digital twin technology lets companies do this sort of bold analysis in a totally risk-free model. And it can be done quickly and transparently.

A digital twin enables you to model your data center in any configuration, and through any complication or disaster event. When you can uncover and resolve weaknesses effectively, a big ask from the business no longer sends chills down your spine.

Unquestionably it will remain a key part of the job description for infrastructure and operations profesionals to avoid data center outages. But how they do this is going to have to change dramatically. Fortunately, the capabilities are already here to make this possible. Deploying a digital twin enables you to prepare for failure scenarios and to safely implement changes.

The rest of the business is going through a game-changing digital transformation. With the digital twin approach, data centers can make a similar, huge step forward. So, if you’re going to have to make a mistake, make sure it’s only a virtual one.