Data center failures are not new. But increased dependency on IT, and increased co-dependency between IT systems, means the impact of failures reverberates ever more widely. 2016 has seen some high-profile data center incidents that made national news, and, moreover, have proven highly expensive for the operators. A recent 451 Research paper on 2016 power failures highlights growing complexity in the colocation industry. At least nine data center providers experienced a power failure in 2016, including four in London.

According to the research paper, the way that resiliency is achieved at the data center level is expected to change significantly in the coming years as we move to a more cloudy, hybrid and distributed environment. In the meantime, the evidence of these incidents suggests that IT managers need to maintain if not increase their vigilance, because the interdependencies of real-time systems means the costs of failures are higher than ever. 

In a world where social networks can drive the news agenda, these incidents soon become high profile and have a large monetary and reputational impact.

Network cable failure server crash error outage threat
– Thinkstock / AKodisinghe

No single point of failure

Nowadays, there are some common themes in the most recent failures. Failures in the power-distribution equipment have been the root cause of several incidents, and problems with IT recovery have often amplified the severity of the issue.

It is important to note that data center power failures are almost never caused by a single problem. Failures arise from multiple processes or small design elements that have potentially been overlooked or viewed as too insignificant to pose a real threat. Constant vigilance is a mantra that should be adopted and data center providers need to plan for all eventualities – unforeseen or otherwise.

Legacy data centers and power outages

A significant proportion of data centers in London were built in the 1990s and early 2000s; a large number of these legacy sites need to go through major overhauls, potentially exposing their customers to more risks of power outages and more disruption. Customers of these legacy sites are also more at risk of power outages, as the recent data center failures in London have demonstrated.

Data center power failures are almost never caused by a single problem. Failures arise from multiple processes or small design elements that have been overlooked

It is likely that data centers built in the 1990s and 2000s will go through large overhaul processes in the coming years as we move to cloud-based environments. The larger the required overhaul, the higher the risk of disruption for their customers’ business.

Careful planning is needed to effectively manage the extensive maintenance required during a large overhaul. Furthermore, the data center must have dedicated staff to look after the maintenance before, during and after the overhaul. If this is not effectively implemented at every stage, then their customers’ data may be inaccessible, causing detrimental effects to their business.

If disruptions occur, customers will have to liaise with their end customers in order to communicate the potential disruption. This is likely to damage their brand and also upset the relationship with their end customers.

Organizations using legacy data centers should assess the risks of staying with older data centers. Moving to a more modern colocation provider could save on stress of going through major overhauls or the cost of suffering from a power outage.

Lessons and Implications

  • Vigilance and investment are essential
    Most data centers have acceptable levels of resiliency – but often, some processes or small design elements had been overlooked. Management needs to pay constant attention, just as IT management use penetration testing to test security resilience.
  • Failures are no longer binary
    Increasingly, applications are distributed, running across multiple sites, calling in remote services. This means failures are often partial, with some components running well, others badly or not at all. This can cause some systems to fail, others to lack key data. This makes diagnosis and resolution difficult; it can also cause contractual disputes.
  • Failures are likely to be noticed
    Most data centers, especially colocation companies, house many clients and systems, including many operated by service providers that, in turn, have many clients. Failures will be noticed quickly, and social networking will ensure that competitors and press are alerted. Failures are now both an operational matter and a reputational issue.
  • Assess the risks of your data center
    If you are with a legacy data center then it is vital that you weigh up the cost of moving your data over to a new, more modern provider versus the risk that may be incurred by remaining with your legacy provider

Jonathan Arnold is managing director at Volta Data Centres