The cascading problems were the result of Amazon's efforts to promise continuous availability of its cloud storage. That meant no downtime for maintenance windows—Amazon's network techs had to work without a net, and this time they were unlucky. But a dive into the details of the outage suggests that a cloud like Amazon's may not be worth the risk, or even offer an advantage, for big retailers—even though Amazon itself is one of the biggest.
The question comes down to whether retailers need the constant availability that Amazon's cloud storage offers. There's a time when brick-and-mortar stores are closed and E-Commerce traffic slows to a trickle. Every retail IT shop knows when that is, and that's the time for a maintenance window—shutting down all or part of a datacenter to make significant changes. That window provides a safety buffer, when changes can be tested and a single error can't turn into a runaway problem.
Amazon's cloud operation doesn't have that option; it has to promise customers that the cloud will be available all the time. That need to provide the 24/7 uptime retailers don't need could be the greatest source of risk for any retailer considering the cloud.
According to Amazon, its cloud outage began with exactly the sort of change that a conventional datacenter would use a maintenance window for. At 12:47 AM Los Angeles time on April 21, as part of a procedure to upgrade the network for one of Amazon's "availability zones," network techs shifted traffic off one high-capacity router to clear the path for upgrading it.
Traffic was supposed to be shifted to another high-capacity router. Instead, it was mistakenly redirected to a low-capacity network that couldn't handle the load of storage nodes that are constantly making copies of themselves—a fundamental process in the way Amazon's cloud storage works.
With the sudden loss of a usable network, many storage nodes lost contact with their replicas.With the sudden loss of a usable network, many storage nodes lost contact with their replicas. Amazon's system is set up so that when that happens, the storage node assumes the replica has gone bad and immediately begins searching for a place to create a new replica. Normally, that would happen in milliseconds. But it wasn't until techs identified and corrected the network mistake that those storage nodes could try to mirror themselves.
When the network was restored, it was a catastrophe. A large number of nodes simultaneously went looking for places to replicate. The available free storage was quickly exhausted, leaving many nodes stuck in a loop, searching for free space—what Amazon called a "re-mirroring storm" that prevented 13 percent of the storage volumes in the affected availability zone from doing anything other than looking for space that wasn't there.
All those requests for more space were hammering on a software control plane that did the work of creating new storage volumes. Because the control plane was configured with a long time-out period, requests for space began to back up. That used up all the processor threads for the control plane, which locked that up. Result: The problems spread from a single availability zone to other cloud availability zones in the Virgina datacenter.
At 2:40 AM Los Angeles time—two hours after the original network mistake—techs disabled the capability of nodes in the original availability zone to ask for new space. By 2:50 AM, the control plane began to stabilize.
But by 5:30 AM, as the number of stuck storage nodes increased, the control plane began to fail again—and this time, it was knocked out entirely. At 8:20 AM, techs began disabling all communication between storage nodes in the original availability zone and the control plane. Once again, everything outside that zone began returning to normal.
By 11:30 AM, techs figured out a way to block the servers in the problem zone from asking each other for storage space that none of the other servers had, either. By 12:00 PM, error rates had returned to near normal—but the number of stuck volumes was back up to 13 percent.
And the only way to get them unstuck was to physically bring in lots more storage. There was no way to kill off the many stuck data replicas until working replicas were created, nor was there space to create the working replicas without new hardware. Amazon couldn't even use its own cloud services for that storage—its "regions" are kept isolated from each other to keep problems from spreading.
Techs weren't able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage.Techs weren't able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage. By 12:30 PM, all but 2.2 percent of the volumes were restored, although not all of them were completely unstuck. It took until 11:30 AM on April 23 to work out how to reconnect the stuck volumes to the control plane without overloading it again and to test the process. By 6:15 PM, most nodes were communicating again.
Then came the process of manually trying to fix the remaining 2.2 percent of the nodes that were still stuck. By 12:30 PM on April 24—three and a half days after the original outage—all but 1.04 percent of the affected volumes were recovered. In the end, 0.07 percent of the volumes could never be restored. (Amazon sent snapshots of that data to the customers it belonged to, advising them, "If you have no need for this snapshot, please delete it to avoid incurring storage charges.")
And Amazon's cloud database service? That was affected, too. And the results were even more catastrophic. The cloud database service uses the cloud storage system. For customers whose databases were entirely in the crippled availability zone, even though at worst only 13 percent of the storage volumes were stuck, at the peak of the problem 45 percent of those databases were crippled by stuck volumes.
The final tally for the outage: Exactly half a week during which a significant number of Amazon cloud customers suffered from crippled or nonexistent IT functionality.
In a conventional datacenter, with a conventional approach to maintenance windows, that would have been almost impossible (although American Eagle Outfitters might beg to differ). The initial network configuration error would probably have been caught as soon as testing of the changes began. The cascade of stuck storage nodes, the control plane thread starvation, the exhausted storage space and the crippled databases—they never would have happened.
But all that technology dedicated to supporting Amazon's high priority for availability ultimately produced 1 percent of a year as downtime in a single stretch.
Amazon has outlined changes it plans to make, and that should make the next incident less painful. But when an IT shop has to work without a net, there will be a next incident. There's no way to avoid it.