Amazon Web Services – 48 Hours of Downtime and Counting

Now I’m really surprised at just how big of a failure in design and engineering this was from an organization apparently so flush with talent. The complexity of cloud infrastructure or more specifically server and storage virtualization technologies has been exposed by this incident. Now in the 49th hour of their Northern Virginia data center outage that took down multiple availability zones in their US East region affecting thousands of customer’s and well known web sites with still no formal ETA on when services will return to normal. The AWS status console web page at http://status.aws.amazon.com continues to add “we are working on it” remarks and claims that many, if not most customers have been returned to normal however the forums are still being inundated with requests for status updates and power operations on “stuck” EC2 virtual machines. I can confirm that several of my customers remain disconnected from their EC2 instances since the event began on Thursday morning, 4/21 at approximately 3:55 AM EST, the time stamp on my Cloudwatch alert e-mails. None of those systems are mission critical, the cloud is too immature for that right now in my opinion, but none the less costing thousands of dollars in lost revenues along with reputation damage with their web customers.

The real problem here is that Amazon themselves either did not fully understand the complexity of their own infrastructure or at least the sales and marketing group wasn’t fully communicating with the engineering and operations team. Amazon’s availability zones were supposed to be independent of each other and thus able to tolerate and isolate system component failure in any single specific zone without crossover effect on others. Their Elastic Block Storage (EBS) service which is the persistent storage layer (disk) in the AWS stack however was not so autonomous and it’s failure or more specifically, capacity congestion triggered by likely the loss of heartbeat traffic between nodes which caused the storage nodes to begin establishing HA pairing with other nodes by re-mirroring their volumes, the result was complete backplane and raw storage capacity starvation, to the rest of us the effect was no response from EC2 and RDS instances for two plus days now.

How will this affect the rush to the cloud? In the short term it will slow down the pace of new migrations as business managers compare the risks versus fiscal costs of moving from proven private datacenter models to the budget conscious cloud model and Amazon is surely to feel fallout from this event through migrations from AWS to other platforms such as Rackspace, one of, if not the other enterprise class provider currently in the market (there are dozens of other cloud providers, Rackspace however is the closest service competitor with AWS). Then the question becomes how will Rackspace or others hold up to the short term pressure of this new customer base, a problem that AWS might have been facing due to the significant growth in their customer base in the past four months. In the long term this event will just become a faint memory, cloud models make too much sense for business cost controls, shifting technical responsibilities to technology companies and not maintaining the small silos of “IT” staff, often under trained and over worked, along with high costs for small scale datacenter operations can’t hold up over time, there is too much to gain from computing as a utility. Time will tell.

There certainly is a lot to ponder here as to the net effect of this “Disaster in the Cloud” but I guess we all have some time to do that while we wait for the Amazon to bring our systems and data back to us.

written by

The author didn‘t add any Information to his profile yet.

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>