Post-mortem of the Amazon Cloud Disruption
Last month, Amazon Web Services had a major outage which resulted in downtime for a number of companies who are using AWS as their infrastructure provider. This has given rise to a host of concerns for everybody interested in cloud computing, and it is important to understand the reasons for the outage, the long-term implications, if any, of this outage, and most important of all, what changes users of cloud infrastructure should make in their architecture and processes so that they’re less affected by such problems.
Suhas Kelkar, who is the Director of the Innovation Team at BMC Software India has done this port-mortem of the incident.
A couple of years back, Suhas had written an article for PuneTech titled Musings on Why Cloud Computing will Prevail which is also interesting reading in this context.
How to prevent such outages from affecting your own infrastructure? A few days after the outage, Dhananjay Nene, Chief Architect at Vayana, and also a consulting software architect, wrote an article arguing that the cloud just got stronger as a result of the AWS outage.
Here are his recommendations:
AWS has multiple availability zones. An application should ideally leverage at least two. If you read the Netflix presentation I referred to, Netflix apparently uses three. Do not assume the servers will not go down. Assume it is possible that at least one availability zone could go down. Make sure you have the systems to quickly activate, systems in the alternative availability zone. For that you will need to find ways to keep data current across availability zones. Also find ways to ensure you have the ability to quickly switch to and fro between availability zones. More advanced options could include concurrently active systems across availability zones or those spread across AWS regions or even between AWS and other vendors.