Crushed wheels caused a Google server rack to overheat at one of its edge network points-of-presence.
According to a post by Google Cloud solutions architect, Steve Mcghee, site reliability engineering (SRE) was notified of the “abnormally high number of errors” that emanated from an unnamed data center. The issue was fixed before it could lead to any noticeable impact on users, Google said.
No details have been given by Google about when or where this happened, however, the malfunctioning machines were removed and SRE migrated the workloads over to redundant “resources.” Engineers then isolated the problem to a single rack and uncovered kernel messages in the rack's base system log.
“Package temperature above threshold, CPU clock throttled (total events = 1596886),” the message read.
The leaning tower of Google
The issue was revealed to be damage on the rear wheels, which titled the rack forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled. Google says this was repaired and the rack was returned to proper alignment.
The company then systematically replaced all racks with the same wheels to ensure the issue will not happen again.