"One thing that is true about Amazon's site is that it is very complex, utilizing numerous backend database, proxy servers, distributed application and Web servers, lots of dynamic images, etc.," said Shawn White, director of external operations at Web site performance tracking firm Keynote. "Even accessing the homepage involves complex multi-step interactions between the Web browser and a number of backend systems within Amazon."
"To do what Amazon does, in providing a highly personalized user experience with the visual richness users have come to expect, this complexity shouldn't be a surprise," White said. "However, the challenge in the IT world is that the more complex something is, the more likely it is to break or be broken. That is what I believe may be going on here."
The site started crashing at 10:16 AM (PDT), displaying a 503 Service Unavailable message. Within five minutes, White said, "the Web site was completely offline: 100 percent unavailable."
Over the next 2 hours, 44 minutes, the site slowly began to increase its availability, to about 10 percent availability. "Users who are able to access the homepage are experiencing very slow download times," roughly four times slower than normal, White said, "to the point where browsing the site is almost impractical."
By 1:00 PM (PDT), Amazon said, the site was back to normal.
Internet chatter during the incident pointed to several possible causes, including some releases of very popular products. But White dismissed such theories, pointing to Amazon's strong history of anticipating and handling such traffic peaks.
Amazon is "very used to high loads and user demands. Amazon was one of the few sites that performed very well during Keynote's coverage of Black Friday and Cyber Monday last year," White said. "Keynote has a hard time believing their site is succumbing to some type of peak-load issues. They know how to handle load."
That led White to suspect Amazon's server sophistication. The same attributes that allow strong customization could also fuel a crash. White tried to explain why he also was inclined to rule out other rumored causes.
"We do not know what exactly is not working correctly or if this is something within the control of Amazon or from an outsourced vendor they may be using. It is possible that a simple typo or misconfiguration is involved. In either case, most maintenance and changes are done during non-peak periods, not at 10:16 in the morning (Pacific time)," White said. "It is surprising to me that if maintenance or configuration changes were the cause of this current issue, why were they being done during a presumably busy period? The other thought would be some type of deliberate or malicious cause. While possible, given the technical sophistication, experience and high-profile nature of Amazon, I have to assume this is not likely, as they have very strong security policies in place."