PayPal's outage again spotlights the problem of backup strategies that simply don't. It's painfully reminiscent of recent datacenter fiascos at American Eagle Outfitters and Wal-Mart. And while some major retailers were kept apprised of the progress of PayPal's outage and disabled PayPal payment functionality on their E-Commerce sites to minimize problems, most of PayPal's customers got the word late or not at all. Apparently there was no effective plan for dealing with that side of the outage, either.
PayPal isn't saying much about the outage beyond its official statement by Scott Guilfoyle, the company's senior VP for platform services: "At around 8:07 AM [San Francisco time Friday], a network hardware failure in one of our datacenters resulted in a service interruption for all PayPal users worldwide. Everyone in our organization was immediately engaged to identify the issue and get PayPal back up and running. We were not able to switch over to our backup systems as quickly as planned. We partially restored service by approximately 8:45 AM and the issue was fully resolved by 9:24 AM. A second service interruption started at around 11:30 AM and was partially resolved at 11:55 AM with full recovery at 12:21 PM."
But the company's "Live Site Status" blog tells a more detailed story. According to the technical blog, the incident (and PayPal) went down like this:
Notice that along with PayPal's two big technical glitches—the networking hardware meltdown and the failover that didn't work—there was a third non-technical failure: It took more than an hour for PayPal to announce the first outage to its users. Indeed, that outage was actually resolved by the time the company's corporate communications department announced that PayPal was down. The second outage and its resolution weren't announced until Friday evening.
That meant it was up to major E-tailers to contact PayPal on their own to find out exactly what was happening. Even for them, it took hours after the outages began to get the necessary information and cut off PayPal functionality.
It's understandable that many E-Commerce players are still trying to get a solid understanding of how crucial it is to keep everything running. Five- and 10-minute outages still aren't unusual, and it's tempting to assume that every outage will be fixed in just another minute.
But that's a dangerous way of thinking. In PayPal's case, it meant that big customers—who in this case were also big retailers—remained in the dark while IT people in PayPal's datacenter assumed that the problem was about to be solved.
Like American Eagle, PayPal had a fallback plan. But it didn't work the way it was supposed to. And though it had a technical plan (that didn't work) for dealing with the outage, like Wal-Mart, PayPal didn't have any plan at all for quickly notifying the people most affected (Wal-Mart's store personnel, PayPal's biggest E-Commerce partners).
The lesson about failed backup plans just keeps getting bigger. Yes, improbable failures can happen. When they do, failover plans can fail. And when that happens, you need a plan already in place to warn those affected in real time.