Wal-Mart's Painful Common-Point-Of-Failure Lesson

Thanks to a single circuit breaker in a datacenter, Wal-Mart lost its capability to accept credit and debit cards across its entire U.S. chain for as long as five hours last Thursday (Sept. 23).

That common point of failure reduced the world's largest retailer to taking cash and checks at its 4,300 U.S. Wal-Mart and Sam's Club stores, with some stores actually closing until the problem was resolved. It also pointed out a fatal flaw in how Wal-Mart's systems were set up: One breaker mishap was able to bring down the entire system.

Wal-Mart won't explain exactly what happened. That's partially because the chain itself hasn't completely determined what happened, but an internal investigation is underway. Whatever precisely caused the glitch, it had two elements. The first was an electronic system problem. But that wouldn't have shut everything down had it not been compounded by an employee's ill-fated—albeit well-intentioned—attempt to manually fix the issue.

What's clear, however, is that the retailer believed it couldn't happen. No fallback plan existed. When the in-store card-swiping machines couldn't talk to the datacenter, the stores simply couldn't accept payment cards. As with other recent retail IT problems, Wal-Mart couldn't see it coming.

Sears didn't think to test its new anti-cookie Web site policy. American Eagle Outfitters trusted its partners' backup procedures too much. And Wal-Mart failed to spot the single point of failure that could bring down its whole chain.

The retailer's official explanation of the IT glitch was brief and vague: "While doing some required maintenance on our data system, we had a breaker fail on the back-up system, which disrupted our ability to process some credit and debit card transactions this morning."

"Within a matter of minutes, we began making some changes to try and minimize the impact to our customers and allow most credit transactions to be processed," the Wal-Mart statement continued. "We were able to restore the system back to normal in approximately 90 minutes." (Note: Multiple Wal-Mart stores have reported that their payment-card processing capabilities were dead for much longer, sometimes as much as five hours.)Even though all POS units in every one of the chain's more than 4,300 U.S. stores stopped being capable of handling payment cards, Walmart.com was still able to process credit and debit transactions. That's because online payment processing is on a different system.

Had Wal-Mart realized that its dot.com arm was the only part of the company capable of processing payment card transactions, it could have suggested that to the frustrated customers who visited its stores that morning.

The results in the brick-and-mortar stores were chaotic. Effects of the outage stretched from coast to coast. Within minutes after the outage started at about 9 A.M. Chicago time, some Wal-Mart stores closed to avoid dealing with long lines of customers who couldn't pay for their items. Other stores remained open but had greeters inform customers of the sudden cash-or-checks-only situation, which turned many customers away. Still other stores didn't inform customers until they arrived at the checkouts.

But which way each store went was up to the individual store's manager. No chain-wide policy existed for this situation. Wal-Mart simply didn't believe it could happen.

By 11 A.M. Chicago time, many stores were able to handle payment cards. But some stores reported they were still having problems more than five hours after the outage began.

Wal-Mart wouldn't estimate exactly how much the outage cost. And how could it? Customers were turned away. Still other customers went elsewhere as word spread about the outage. Shopping carts were abandoned at checkouts when customers couldn't pay—which resulted in fresh groceries that had to be thrown away because the food couldn't be restocked. The final tab for the failure may not be known for months.

That was the result outside the datacenter. What happened inside is a lot less clear.

Wal-Mart's statement that "While doing some required maintenance on our data system, we had a breaker fail on the back-up system" leaves more questions than answers.

Wal-Mart's IT people were doing "required maintenance"? It certainly wasn't the kind of maintenance you'd schedule in advance. No one schedules maintenance on a critical system at 9 A.M. on a Thursday. That's what the graveyard shift is for.During that maintenance, Wal-Mart "had a breaker fail on the back-up system"? Was that a data backup? That wouldn't disrupt a system unless data was being restored when the power went out. And why would Wal-Mart IT be restoring data to a critical system at 9 A.M.?

More likely is that the "back-up system" in Wal-Mart's fuzzily worded statement was a back-up power system—an uninterruptible power supply (UPS) that should have been able to fail without any impact at all, because it's just there in case the main power supply fails. How could that have caused a major failure?

Unfortunately, far too easily. Think of all those racks of equipment in your datacenter—servers and data arrays, each equipped with two power supplies, so if one fails the other will keep going. Those power supplies are supposed to be plugged into two different UPSs. That way, if one power source goes south, the other will keep things running.

But if both power supplies from a piece of equipment are plugged into the same UPS, then a single UPS outage—caused by a circuit breaker that fails—will take it down hard. Yes, that happens. Racks make it easy to swap equipment in and out. But when that's done too quickly, it's easy for IT operations staff to lose track of where the juice for each power supply is coming from. That creates a single point of failure—one that can go for months or even years without being spotted.

Still, that's all speculation. The simple reality for Wal-Mart is that each store's capability to handle payment cards depended on a connection from every card-swiping device to the single card-processing system.

As a result, one breaker gone bad was able to cripple thousands of stores. That's a catastrophic IT operations failure. It shouldn't have happened. What's worse, Wal-Mart clearly couldn't see that it ever could happen.

Yes, it can happen. That's why you have redundant power supplies and do data backups. It's why you set up plans to deal with the unthinkable and design systems so that when things do go wrong—even things that shouldn't ever go wrong—there's a fallback plan, not just in IT but in the stores, too.

Wal-Mart didn't just have a failure in IT operations or a problem with one bad breaker. It had a fundamental weakness in its IT systems—and a blind spot that prevented anyone from seeing it in time.