Human Error Is Still Amazon Cloud's Achilles Heel

The Amazon Cloud outage on December 24—the one that knocked Netflix offline for much of Christmas Eve—was due purely to human error. And it was the dumbest sort of human error: an Amazon developer with special privileges mistakenly ran a maintenance process against the production system, wiping out critical state data—and then didn't realize he had crippled the system until hours after it began causing problems for customers, according to the version of events Amazon released on Monday (Dec. 31).

It then took more than 12 hours (including a false start or two) for Amazon's team to re-create the data, and several more hours to slowly get the system working again. Total outage time: possibly the longest 23 hours and 41 minutes in Amazon's history.

According to Amazon's own summary of the outage—beg pardon, "service event"—the problem originated in the load-balancing systems for Amazon's cloud and only affected customers in the Eastern region of the U.S. At 12:24 PM Pacific time (3:24 PM Eastern) on December 24, "a portion of the ELB [Elastic Load Balancing System] state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example, tracking all the backend hosts to which traffic should be routed by each load balancer)," according to Amazon.

Translation: Amazon's cloud forgot everything it knew about how to let customers do load balancing.

The data was deleted by "one of a very small number of developers who have access to this production environment," inadvertently running the maintenance process against the production ELB state data, according to the Amazon report.

How was that possible? It turns out that most of the access controls for the cloud go through a strict change management process, which should have prevented this mistake. But Amazon is in the process of automating some cloud-maintenance processes, and a small number of developers have permission to run those processes manually. It also turns out that once those developers accessed the processes once, they didn't have to go through an access process again—in effect, getting rid of the "Do you really want to bring the Amazon Cloud crashing down? OK/Cancel" message.

Yes, Amazon has fixed that—now everything goes through change management. Back to the timeline:

At 12:24 PM on December 24 ELB state data was deleted. "The ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers," according to Amazon. But the system was still handling basic load-balancing requests to create and manage new load balancers, because it didn't need state data to do that.

Amazon's technical teams spotted the API errors but didn't spot the pattern that new load balancers could be managed while older (pre-12:24 PM) load balancers couldn't be properly managed, because their configuration data was gone.

Meanwhile, some customers began to see performance problems with their cloud applications. It wasn't until the team started digging into the specifics of those performance problems that they spotted the missing state data as the root cause of the problem.

At 5:02 PM on December 24 the Amazon team stopped the spread of the problem and began looking for a way to fix it.At 5:02 PM on December 24 the Amazon team disabled the ability of the load balancers to scale up or down or be modified. That stopped the spread of the problem. "At the peak of the event, 6.8 percent of running ELB load balancers were impacted," Amazon said. That's a bit coy, though—the other 93.2 percent were technically operating correctly but outside the control of customers.

The team manually recovered some of the affected load balancers, but the main plan was to rebuild the deleted state data as of 12:24 PM, then merge in all the API calls after that point to create an uncorrupted configuration for each load balancer and get them working correctly again.

The first try at doing that took several hours. It failed.

At 2:45 AM on December 25 a different approach finally made it possible to restore a snapshot of the ELB state data to what it was more than 12 hours earlier.

At 5:40 AM on December 25 15 hours worth of API calls and state changes were finally merged in and verified, and the team began to slowly re-enable the APIs and to recover the load balancers.

At 8:15 AM on December 25 most of the APIs and workflows had been re-enabled.

At 10:30 AM on December 25 almost all load balancers were working correctly.

At 12:05 PM on December 25 Amazon announced that its U.S.-East cloud was operating normally again.

Yes, Amazon has learned from the experience—changing access controls so a programmer can't do that again, adding checks for the health of state data to its data-recovery process and starting to work up ways for load balancers to heal themselves in the future. And to be fair, this incident wasn't as lengthy as the April 2011 incident in which a change in network configuration (another human error) paralyzed much of Amazon's cloud storage for three days.

But coming on Christmas Eve, the timing was really lousy. Fortunately for most big chains, they're not using Amazon's cloud—yet.

But these high-profile Amazon problems are actually a good sign, at least if you're not Netflix or one of the smaller E-tailers who were slammed by the load-balancing failure. The catastrophes are shorter, fewer and further between, and Amazon is getting better at dealing with them.

The problem that remains: Amazon is still learning. And all that learning is still happening in datacenters that never shut down for maintenance.

There's an irony here: We know from our own experience that everything in a cloud can be moved out of a particular datacenter, so it doesn't have to run 24/7 forever. When Hurricane Sandy was heading up the East Coast of the U.S., StorefrontBacktalk's cloud provider (not Amazon) moved our virtual production server from Virginia to Chicago, apparently without a glitch, and presumably did the same with all the other cloud servers in that threatened Virginia datacenter. We have no doubt Amazon can do the same thing (and quite possibly did).

That dodges some maintenance-related problems. But it wouldn't have helped in this case. Things are getting better. But cloud still may not be mature enough for a big chain's production E-Commerce system—with or without humans.