Amazon Web Service has explained what went wrong at its US-EAST-1 cloud region last week.

On November 25th, the company's streaming data real-time processing service Amazon Kinesis stopped working at its Northern Virginia data center campus

Kinesis is used by other AWS services, which also stopped working - knocking a number of customers offline, including Flickr, iRobot, Roku, and AWS's own Service Health Dashboard.

A Thanksgiving surprise

Amazon fire
– Sebastian Moss

At 2:44 AM PST, the company added capacity to its AWS Kinesis front end. "Kinesis has a large number of 'back-end' cell-clusters that process streams," AWS explains.

"These are the workhorses in Kinesis, providing distribution, access, and scalability for stream processing. Streams are spread across the back-end through a sharding mechanism owned by a “front-end” fleet of servers. A back-end cluster owns many shards and provides a consistent scaling unit and fault-isolation. The front-end’s job is small but important. It handles authentication, throttling, and request-routing to the correct stream-shards on the back-end clusters."

Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map. This information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers.

"For the latter communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants."

As this was happening, a number of errors began to occur - some related to the new capacity, some apparently unrelated. "At 7:51 AM PST, we had narrowed the root cause to a couple of candidates and determined that any of the most likely sources of the problem would require a full restart of the front-end fleet, which the Kinesis team knew would be a long and careful process," Amazon said.

"The resources within a front-end server that are used to populate the shard-map compete with the resources that are used to process incoming requests. So, bringing front-end servers back online too quickly would create contention between these two needs and result in very few resources being available to handle incoming requests, leading to increased errors and request latencies. As a result, these slow front-end servers could be deemed unhealthy and removed from the fleet, which in turn, would set back the recovery process."

The company believed that the route cause could be an issue that was creating memory pressure, but knew that if they were wrong it would double recovery time as they would need to apply a second fix and restart again.

"At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.

"We didn’t want to increase the operating system limit without further testing, and as we had just completed the removal of the additional capacity that triggered the event, we determined that the thread count would no longer exceed the operating system limit and proceeded with the restart.

AWS began bringing back the front-end servers with the first group of servers taking Kinesis traffic at 10:07 AM PST. "The front-end fleet is composed of many thousands of servers, and for the reasons described earlier, we could only add servers at the rate of a few hundred per hour. We continued to slowly add traffic to the front-end fleet with the Kinesis error rate steadily dropping from noon onward. Kinesis fully returned to normal at 10:23 PM PST," the company said.

To avoid the same issue having again, AWS will move to larger CPU and memory servers, reducing the total number of servers and threads required by each server to communicate across the fleet.

"We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet. In the medium term, we will greatly accelerate the cellularization of the front-end fleet to match what we’ve done with the back-end. Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range."

The company also apologized for the issues, noting how critical the service is to its customers and their customers.