OpenAI has hired Todd Underwood to head a new Site Reliability Engineering team focused on research and training workloads.

The generative artificial intelligence company already has an SRE team for the applied side working on inference and API products, Underwood said.

As their name suggests, SREs are tasked with building and maintaining highly reliable and scalable software systems. The concept originated at Google, but has since spread across the IT industry.

– Google

"At Google I created the Machine Learning Site Reliability Engineering (ML SRE) organization," Underwood said on LinkedIn. "We founded it in 2016 (there was already a Cloud ML SRE team; we built one for internal services and then combined them).

"After a reorganization split those teams up, I went off to work on Capital Engineering... Recently, I really wanted to get back to more SRE work but also to move closer to the ML infrastructure, especially the training infrastructure. Hence OpenAI!"

Underwood spent 14 years and nine months at Google and is co-author of the O'Reilly book Reliable Machine Learning.

In his post, Underwood added: "I’m now in a position to build a new team of ML training infrastructure at some interesting scale (even interesting for folks coming from Google, I dare say).

"This is a team that will need to be involved in the infrastructure from the ground up to the model, with opportunities to work on hardware health of accelerators, job orchestration and execution, model dynamics, and of course a special focus on metrics and measurement."

Underwood joined the ChatGPT-company last week amid the chaos of CEO Sam Altman's firing, and was among those that signed a letter threatening to quit and join Microsoft if he wasn't rehired. Altman was back at OpenAI after five days.

"I will say that this was a slightly more interesting onboarding than I have ever had at any job," Underwood said. "The full story might require a beverage and a relaxed setting."

That week, OpenAI also hired the former lead for Google's TPU AI chip to head a new hardware division.