An electrical distribution system failure at a Microsoft data center led to a two-hour outage when the company tried to swap from utility power to its backup generators.
In an incident report published this week, the company said a “power issue” impacted a subset of customers in a single Availability Zone within the West Europe region in the Netherlands, between 07:31 and 09:15 UTC on 20 October 2023. Azure services including App Service, Cosmos DB, SQL DB, Storage, and Virtual Machines were affected.
Microsoft said it had detected “instability” from the utility power grid in the form of voltage sags/swells to one of its data centers within the AZ-01 Availability Zone.
As a result, the company decided to transfer the load from the grid to back-up generators, but an issue with the generator startup led to an outage for some racks.
“During this process, a critical failure occurred in a section of the electrical distribution system, preventing 10 percent of our generators from taking load. This failure left the main distribution system offline and the redundant system inaccessible. As a result of this failure, approximately 1 percent of our server racks in this Availability Zone lost power.”
The nature and cause of the distribution failure weren’t detailed.
As the grid had stabilized, the company swapped back from the generators to utility power.
“In total, five Storage scale units were impacted by this incident. Following power restoration, four recovered completely by 09:10 UTC, while the fifth required hardware diagnostics and part replacements on approximately 5 percent of its storage nodes,” the company said in the incident report. “As a result, it took longer to restore availability for the last <1 percent of storage accounts, with downstream impact to customers and services reliant on this final Storage scale unit. By 14:30 UTC, all but a few storage accounts had their availability restored, and by 17:10 UTC, full restoration was complete.”
Microsoft said it will publish a second impact report in the coming weeks detailing additional details/learnings – including repair items related to the event, and any potential repair items for downstream services to recover from scenarios like this one more quickly.
The West Europe Azure region opened back in 2010. It operates with three Availability Zones.
In late August, a utility sag in Australia led to an outage at a Microsoft data center in Sydney. Chillers were brought offline during a thunderstorm and failed to automatically restart, leading to an outage.