Amazon.com Inc. issued an apology Friday for a glitch that shut down its servers and knocked out access to a slew of websites for several days.
The outages, which struck April 21 and ran through Sunday, affected widely used websites and Web-based services such as Foursquare, HootSuite, Reddit and Quora. In addition to its retail business, Amazon provides Web services for companies.
Amazon’s apology came at the end of a 5,679-word letter that explained what caused the temporary failure and said affected customers would have a 10-day service credit automatically added to their accounts.
“Last, but certainly not least, we want to apologize,” Amazon’s Web Services unit said to conclude its letter.
“We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services,” Amazon said. “As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.”
At the root of the outage was an incorrectly performed network change at a data center in northern Virginia, Amazon said.
“The configuration change was to upgrade the capacity of the primary network,” Amazon said in the letter. “During the change, one of the standard steps is to shift traffic off of one of the redundant routers” to allow the upgrade to take place.
But the “traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant” network.
That move not only resulted in a downed primary network but took out the secondary server network as well, Amazon said.
“Traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving,” Amazon said.
In addition to making technical changes, Amazon said it also will improve the way it communicates with its customers.
“We would like our communications to be more frequent and contain more information,” the letter said. “We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again.”
During the service outages, Amazon said it was focused on fixing the problem as quickly as possible and identifying the cause of the problems. The company said it provided updates online to customers when it had new information to offer.
“That said, we think we can improve in this area,” Amazon said.