With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.
Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours. The company says it began at 10.30 UTC that day.
Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.
Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”
Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”
Voltage sag cause
The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”
“By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”
What was the impact?
Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.
Subsequently, the chilled water loop was permitted to return to a safe temperature, allowing the chillers to be restarted. It nevertheless led to a further infrastructure shutdown and a further reduction in service availability for this Availability Zone. Yet the chillers were eventually and successfully brought back online at 12.12 UTC, and the data hall temperatures returned to operational thresholds by 13.30 UTC. This culminated in power being restored to the affected infrastructure, and a phase process to bring the infrastructure back online began.
Microsoft adds that this permitted its team to restore all power to infrastructure by 15.10 UTC, and once the power was restored all compute scale units were returned to operation. This allowed Azure services to recover. However, some services still experienced issues with coming back online.
In the post-incident review, staffing was considered an issue. So, it’s only natural to ask why that was the case, and to consider what could have been done better. It’s not about lambasting the company itself. Even the best-laid plans to prevent outages can go wrong, and across the industry, there is a shortage of data center talent. So, by examining case studies such as this one, there is an opportunity to establish best practices.
Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented.”
It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”
Yet in a Deep Dive ‘Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.
Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.
“You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.
So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.
Industry issue and risk
Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,” he comments before adding:
“This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.
Davis adds: “At Uptime, we have been doing operational assessments for over a decade, including those related to our Management and Operations stamp of approval, and our Tier Certification of Operational Sustainability. During those assessments, we weigh staffing and organization quite highly.”
Optimal staffing levels
As for whether there were sufficient staff onsite during the Microsoft outage, and what should be the optimal number of staff present, John Booth, Managing Director of Carbon3IT Ltd, and Chair of the Energy Efficiency Group of the Data Centre Alliance, says it very much depends on the design and scale of the data center, as well as on the level of automation for monitoring and maintenance. Data centers are also often reliant on outsourced personnel for specific maintenance and emergency tasks and offer a 4-hour response. Beyond this, he suggests there is a need for more information to determine whether 7 staff were sufficient but admits that 3 members of staff are usually the norm for a night shift, “with perhaps more during the day depending on the rate of churn of equipment.”
Davis adds that there is no reliable rule of thumb because each and every organization and site is different. However, there are generally accepted staff calculation techniques that can determine the right staffing levels for a particular data center site. As for the Microsoft incident, he’d need to formally do the calculations to decide whether 3 or 7 technicians were sufficient. It’s otherwise just a guess.
He adds: “I am sure Microsoft has gone through this; any well-developed operating programs must perform these calculations. This is something we look for during our assessments: have they done the staff calculations that are necessary? Some of the factors to include in the calculations are shift presence requirements – what is the number of technicians required to be on-site at all times, in order to do system checks and perform emergency response? Another key consideration is site equipment, systems, and infrastructure: what maintenance hours are required for associated planned, corrective, and other maintenance? Any staffing calculation considers all of these factors and more, including in-house resources and contractors as well.”
Microsoft: Advocate of EOPs
“From what I know of Microsoft, they are a big advocate for emergency operating procedures and correlating operational drills. The properly scripted EOP, used during the performance of a well-developed operational drill may have supported the staff in this effort, and/or perhaps identified the need for more staffing in the event of such an incident.”
Microsoft had emergency operating procedures (EOPs) in place. They have learnt from this incident and amended their EOPs. They are where organizations need to start, and they should examine testing and drill scenarios. A data center’s best protection is, says Davis, a significant EOP library, based on potential incidents that can occur.
He believes that the Microsoft team did their best and suggests that they deserve all the support available as these situations are very stressful. This support should come in the form of all the training, tools, and documentation an organization can provide them. He is confident that Microsoft is considering all of the lessons learned and adjusting their practices accordingly.
As to whether staffing levels could be attributed to outages, it’s entirely possible, but that might not have been the sole cause in Microsoft’s case as Booth believes there was a basic design flaw. He thinks an electrical power sag should have triggered backup generators to provide power to all services to prevent the cooling systems from failing. There should therefore be an improved integrated systems test, which is where you test every system under a range of external emergency events. The test program should therefore include the failure of the chillers and any applicable recovery procedures.