In November 2017, AWS changed how it charged for a service. The switch, made suddenly and with little fanfare, was touted as a small improvement - but raised prices and accidentally stymied a government cloud project.
AWS EC2 Spot Instances, launched in 2011, have always been something of a gamble. Available at a significantly lower price than standard EC2 instances, the Spot market allows users to bid for the remaining capacity in an AWS data center. The more bids, the higher the price - or at least, that’s the claim.
While Spot Instances are cheaper, users run the risk of the work being terminated if the Spot price exceeds the maximum price bid by the user, or if the capacity is no longer available.
“What you're looking at there is our attempt to recover the marginal cost of that as-yet unused capacity, capacity that has not yet been sold for demand usage or for reserve instances,” Ian Massingham, AWS director of developer technology and evangelism, told DCD last year.
“So that's essentially what the Spot market is; it is AWS recovering the marginal cost of having large amounts of capacity deployed and unused by customers around the world.” At the time, however, AWS had already changed its algorithm - and Amazon has since declined numerous requests for comment from DCD.
This feature appeared in the November issue of DCD Magazine. Subscribe for free today.
A spot change
In the early years, the potential cost savings from Spot pricing proved enticing for many, including the US National Science Foundation. Rich Wolski, professor of Computer Science at the University of California, Santa Barbara, was part of a team-building a federated cloud for several US universities with NSF backing.
The aim of the Aristotle Cloud Federation was for the institutions to share computing resources across their data centers. “But at some point, if all of the institutions get full, what we want to do is burst from the Federation into Amazon,” Wolski told DCD.
The group decided to use the Spot market to maximize cost savings. Jamie Kinney, AWS senior manager scientific computing, said in a press release at the time: "We are excited to work with the Aristotle team to provide cost-effective and scalable infrastructure that helps accelerate the time to science.”
But as it was university-led scientific research, backed with government money, the ‘bursts’ required some level of predictability. “Universities do fixed budget resource allocation, you get ‘this’ many dollars, and it has to last ’that’ many years,” Wolski said.
So Wolski and his team developed an algorithm to predict Spot price changes, and the likelihood that a workload would be terminated early. “We would be able to say if you bid ‘this’ much, you'll get a day's worth of time, guaranteed with 99 percent probability. It was a great success,” Wolski said. “This went on for a couple of years.”
Then in late 2017, something happened. “We saw in the press that Amazon had changed the pricing. At first, I was overjoyed - we thought, wow, this is great. If you smooth things, the technique that we had developed should just become much more accurate.
“And we started looking at it, and it didn't look right. From a mathematical perspective, from a data analysis perspective, it just didn't look like what the press was saying was happening.
“Why doesn't this look right? Has something else changed? Is our method wrong?” Wolski’s team scrambled to work out what had happened. “We started digging into it, we read everything we could read, and we started seeing reports from the popular press about companies that had their own internal algorithm for optimizing their use of the Spot market. And those algorithms were breaking.
“We went back in and just did a very careful analysis,” Wolski said. The results, published in the research paper Analyzing AWS Spot Instance Pricing (August 2019), found that prices were higher by an average of between 37 percent and 61 percent.
But price increases were not the real issue for Wolski’s team: “If you're doing fixed budget stuff, that just means you have less work to do,” he said.
The problem was that it became far harder to predict which workloads would be terminated, with the system relying less on auction-like market forces, and instead on a hidden algorithm to decide costs and when to end workloads. “This had an impact,” Wolski said. “It was suddenly unreliable, people who were depending on the fact that we can make this prediction could no longer use it.”
The change crucially shifted Spot Instances from what it used to be, Wolski said. “There's no indication that after the change it's a market at all, it's just retail pricing - it's dynamically changing retail pricing. Amazon has full price control of the Spot Instances.”
Spot takeover
“I felt pessimistic,” Professor John Brevik, who worked on the original NSF technique and the subsequent paper, said. “It's sort of like this door's been shut on what was an interesting thing to figure out - this dynamic pricing mechanism and how to predict it. I'll leave it to the more corporate or economically astute to infer why that kind of change happened.”
But while the 2017 shift was the final nail in the coffin for the market system, Spot Instances have always relied on hidden algorithms and an invisible hand to control pricing, Orna Agmon Ben-Yehuda told DCD. Her team at the Israel Institute of Technology studied Spot Instances when they launched in 2011 up to 2013.
“In 2011 we first showed that during the first two years of the operation of the AWS spot instances, 98 percent of the price traces were consistent with being the result of an artificial algorithm. This algorithm computed a reserve price: a price under which AWS were not willing to rent the instance.”
Her work discovered “the existence of several unnatural, artificial characteristics, which had no economic justification.“
She added: “I would like to stress that the problem was never that AWS had a reserve price, or even that they changed it. The problem was that they declared their prices were based on supply and demand... and people believed it, and based their academic work and economic plans on that.”
There are legitimate reasons for some control of the market, Steve Fox, CEO of AWS reseller AutoScalr, said. Users began to realize that if they bid ridiculously high prices that forced others to use the standard On Demand service it would clear the marketplace. "I only have to pay that high price for a couple of minutes, and then I'm back down at the cheap price - and I am never interrupted."
Fox told DCD: "So it turned into this game of chicken, where people started bidding higher and higher and higher, just trying to keep from getting interrupted. And it got so bad, where prices were going extraordinarily high, like 1,000 times. So Amazon put a cap on it to say prices could never go above 10 times the On Demand price."
For AutoScalr and its customers, the 2017 change did bring about price stabilization but left the company in the dark as to when workloads would be terminated. "So now you had a very predictable price that didn't change very fast, but you never knew when it was going to go away," Fox said.
Previously, prices would rapidly rise when more users requested Spot Instances, and it was obvious that the chance of being terminated would rise with it. "And so we had algorithms that would diversify away from the risk and go to more stable spot markets, which meant our overall interruption rate was lower," Fox said.
Now the price change is much slower - "it looks like it's on the order of days or weeks, whereas before, it was minutes.
“So the challenge is, when a lot of people come in and start to use an instance, eventually they run out, and the price doesn't change. But how do they go about picking who to terminate? That's a big mystery."
Fox, whose company is a certified AWS partner, was keen to point out that, despite the change, AWS Spot Instances are still cheap - and added that Amazon regularly makes price cuts across all of its services.
"It's like, yeah, maybe the prices have crept up a little bit. But it's a dramatic way to save on cloud compute. Maybe it wasn't as good a deal as it was a few years ago, but it's still a good deal."
Customers of AutoScalr, he said, continue to use to Spot market. "You just have to lean heavier on diversification, as opposed to prediction."
But for Wolski’s ‘burst’ method for the Aristotle Federated Cloud, the change proved fatal.
Aristotle's tragedy
"I don't want to ascribe to them a nefarious purpose here,” Wolski said. “I think it was more that we were so far underneath their radar that they just missed it." He believes that his technique, which was publicly available, added value to the Spot Market, making it better for other users.
“You have the science community doing something that might make other people use Amazon in a more efficient way, and we're not charging for it,” Wolski said. “I sense that if the Amazon people had thought about that, maybe they would have announced this thing differently, or they would have contacted us.”
His team is currently working on a replacement tool that hopes to rank various cloud companies' instances and compare them against internal capacity for a given workload.
"So if I'm gonna burst with a retail product, what's my spend going to be? What's say the minimum I have to spend to get the same power? Or if I want twice as much power, how much do I have to spend? Or if I can wait twice as long and I want half the power, what do I have to spend? And because it's retail pricing, that will be stable."
While he is working on a solution, the whole experience has given Wolski pause. “It was an important lesson for the science community. Normally, we buy machines. And when you buy a machine, it's that machine until you throw it away.
“It doesn't morph into something else halfway through its lifetime. If on a Tuesday it's an x86 box, it's going to be an x86 box on a Wednesday. But when you buy a service, that can happen. It was still called Spot Instances, it still had the API.
“It's just that on some day, you went to it, and it behaved completely differently.”