Recovery Disaster: PayPal Crash Strands Merchants

Two major technology glitches in a row knocked PayPal offline on Friday (Oct. 29), preventing the alternative payment giant from processing any E-tailer transactions for 80 minutes. First a network hardware failure shut down all PayPal payments. Then the backup plan failed when a handoff to a secondary datacenter didn't go smoothly. The result was a worldwide shutdown of PayPal's $40 billion merchant-services business that left E-tailers scrambling to limit damage from the outage.

PayPal's outage again spotlights the problem of backup strategies that simply don't. It's painfully reminiscent of recent datacenter fiascos at American Eagle Outfitters and Wal-Mart. And while some major retailers were kept apprised of the progress of PayPal's outage and disabled PayPal payment functionality on their E-Commerce sites to minimize problems, most of PayPal's customers got the word late or not at all. Apparently there was no effective plan for dealing with that side of the outage, either.

PayPal isn't saying much about the outage beyond its official statement by Scott Guilfoyle, the company's senior VP for platform services: "At around 8:07 AM [San Francisco time Friday], a network hardware failure in one of our datacenters resulted in a service interruption for all PayPal users worldwide. Everyone in our organization was immediately engaged to identify the issue and get PayPal back up and running. We were not able to switch over to our backup systems as quickly as planned. We partially restored service by approximately 8:45 AM and the issue was fully resolved by 9:24 AM. A second service interruption started at around 11:30 AM and was partially resolved at 11:55 AM with full recovery at 12:21 PM."

But the company's "Live Site Status" blog tells a more detailed story. According to the technical blog, the incident (and PayPal) went down like this:

  • 8:06 AM (San Francisco time)
  • : Networking hardware failed in a PayPal datacenter, cutting off service for all PayPal users worldwide. Ordinary customers received a "Sorry—your last action could not be completed" message. E-tailers using the PayPal APIs got timeouts. PayPal won't say exactly what happened (a backhoe cable cut? a datacenter fire? someone playing tip-the-cow with a rack of switches?), but all users worldwide were cut off in the outage.

  • 8:07-8:44 AM
  • : Merchants and ordinary PayPal customers remained completely cutoff, as PayPal attempted to switch over to its Denver datacenter. PayPal won't explain why the handoff failed for so long.

  • 8:45 AM
  • : The PayPal Web site partially recovered, so some consumers could make payments. Merchant APIs remained down.

  • 9:24 AM
  • : Merchant APIs began to recover, running out of the Denver datacenter.
  • 9:25 AM
  • : Both merchant APIs and Web site payments were fully recovered and running out of Denver. Some merchant API users still experienced timeouts.

  • 9:32 AM
  • : PayPal's corporate communications department announced the ongoing outage via the company blog and Twitter.

  • 10:00 AM
  • : Merchant APIs and Web site payments were switched back to the original datacenter from Denver.

  • 10:43 AM
  • : PayPal announced on its blog and Twitter that the outage was resolved.

  • 11:32 AM
  • : A second outage began. PayPal hasn't given any details on this outage except that payments were unavailable for most (but apparently not all) merchants and ordinary customers.

  • 11:55 AM
  • : Payments for most merchants and ordinary customers were working again.

  • 12:21 PM
  • : The second outage was officially declared resolved.

    Notice that along with PayPal's two big technical glitches—the networking hardware meltdown and the failover that didn't work—there was a third non-technical failure: It took more than an hour for PayPal to announce the first outage to its users. Indeed, that outage was actually resolved by the time the company's corporate communications department announced that PayPal was down. The second outage and its resolution weren't announced until Friday evening.

    That meant it was up to major E-tailers to contact PayPal on their own to find out exactly what was happening. Even for them, it took hours after the outages began to get the necessary information and cut off PayPal functionality.

    It's understandable that many E-Commerce players are still trying to get a solid understanding of how crucial it is to keep everything running. Five- and 10-minute outages still aren't unusual, and it's tempting to assume that every outage will be fixed in just another minute.

    But that's a dangerous way of thinking. In PayPal's case, it meant that big customers—who in this case were also big retailers—remained in the dark while IT people in PayPal's datacenter assumed that the problem was about to be solved.

    Like American Eagle, PayPal had a fallback plan. But it didn't work the way it was supposed to. And though it had a technical plan (that didn't work) for dealing with the outage, like Wal-Mart, PayPal didn't have any plan at all for quickly notifying the people most affected (Wal-Mart's store personnel, PayPal's biggest E-Commerce partners).

    The lesson about failed backup plans just keeps getting bigger. Yes, improbable failures can happen. When they do, failover plans can fail. And when that happens, you need a plan already in place to warn those affected in real time.