Oracle Backup Failure Major Factor In American Eagle 8-Day Crash

It seems a failure in an Oracle backup utility coupled with the failure of IBM hosting managers to detect it and to verify that a disaster recovery site was operational were the key factors in turning a standard site outage at American Eagle Outfitters into an 8-day-long disaster, according to an IT source involved in the probe.

The initial problem was pretty much along the lines of what StorefrontBacktalk reported on Thursday (July 29), which was a series of server failures. But the problems with two of the biggest names in retail tech--IBM and Oracle--are what made this situation balloon into a nightmare.

"The storage drive went down at IBM hosting and, immediately after that, the secondary drive went down. Probably a one-in-a-million possibility, but it happened," said an IT source involved in the probe. "Once replaced, they tried to do a restore, and backups would not restore with the Oracle backup utility. They had 400 gigabytes (of data) and they were only getting 1 gigabyte per hour restoring. They got it up to 5 gigabytes per hour, but the restores kept failing. I don't know if there was data corruption of a faulty process."

Thus far, that's pretty bad. It's a statistically unlikely problem, but site management had insisted on state-of-the-art backup and restore packages so there shouldn't be a huge problem, right? Not quite.

"The final straw was the disaster recovery site, which was not ready to go," the source said. "They apparently could not get the active logs rolling in the disaster recovery site. I know they were supposed to have completed it with Oracle Data Guard, but apparently it must have fallen off the priority list in the past few months and it was not there when needed."

The source added that these situations--as bad as they are--are simply part of the risks of using managed service arrangements at hosting firms, as opposed to handling site management remotely--and with your own salaried people--at a collocation site.

Some IT problems are hard to assign blame for, such as a direct lightning strike that overpowers power management systems. But having a multi-billion-dollar E-Commerce site completely down for several days--and crippled, functionality-wise, for eight days--because of backups and a disaster recovery site that weren't being maintained? That's borderline criminal. Actually, that's not fair. We shouldn't have said borderline.

Consider this line: "I know they were supposed to have completed it with Oracle Data Guard, but apparently it must have fallen off the priority list in the past few months and it was not there when needed." Fallen off the priority list in the past few months? IBM's job is to protect huge E-Commerce sites. After the initial setup, there's not much to do beyond monitor and make sure that backups happen and are functional.

IBM isn't a low-cost vendor, so it will be interesting to see whether those hosting fees are justified. As our source put it: "I am sure there will be an big issue with IBM about getting payback."

What lesson should CIOs and E-Commerce directors take from this incident? They are paying for backup and for a high-end vendor to make sure that backup is working. What more should be required? Does a vendor like IBM require babysitting, where staff is periodically dispatched to the server farm for a surprise inspection of backups?

Perhaps that should be part of an expanded Service-Level Agreement, but that SLA had better include a huge and immediate financial penalty if those inspections find anything naughty. If this American Eagle doesn't get the attention of hosting firms, maybe those penalties will.