Tag Archives: availability

Top 5 things to worry about when designing a Cloud Based SaaS

(This article on things you need to be careful when designing the architecture of a cloud based Software-as-a-Service offering is a guest post by Mukul Kumar, who, as SVP of Engineering at Pubmatic has a lot of hands-on experience with having designing, building and maintaining a very high performance, high scalability cloud-based service.)

Designing a SaaS software stack poses challenges that are very different from the considerations for host-based software design. The design aspects for performance, scalability, reliability of SaaS with lots of servers and lots of data is very different and interesting from designing a software that is installed on a host and is used by that host.

Here I list the top 5 design elements for Cloud Based SaaS.

High availability

SaaS software stack is built on top of several disparate elements. Most of the times these elements are hosted by different software vendors, such as Rackspace, Amazon, Akamai, etc. The software stack consists of several layers, such as – application server, database server, data-mining server, DNS, CDN, ISP, load-balancer, firewall, router, etc. Highly availability of SaaS actually means thinking about the high availability of all or most of these components. Designing high availability of each of these components is a non-trivial exercise and the cost shoots up as you keep on adding layers of HA. Such design requires thinking deeply about the software architecture and each component of the architecture. Two years back I wrote an article on Cloud High Availability, where I described some of these issues, you can read it here.

Centralized Manageability

As you keep on adding more and more servers to your application cluster the manageability gets hugely complex. This means:

  • you have to employ more people to do the management,
  • human errors would increase, and
  • the rate at which you can deploy more servers goes down.

And, don’t just think of managing the OS on these servers, or these virtual machines. You have to manage the entire application and all the services that the application depends on. The only way to get around this problem is to have centralized management of your cluster. Centralized management is not an easy thing to do, since every application is different, making a generalized management software is oversimplifying the problem and is not a full solution.

Online Upgradability

This is probably the most complex problem after high availability. When you have a cluster of thousands of hosts, live upgradability is a key requirements. When you release a new software revision, you need to be able to upgrade is across the servers in a controlled way, with the ability of rolling it back whenever you want – at the instant that you want, across the exact number of servers that you want. You would also need to control database and cache coherency and invalidation across the cluster is a controlled way. Again, this cannot be solved in a very generic way; every software stack has its own specificity, which needs to be solved in its own specific ways.

Live testability

Testing your application in a controlled way with real traffic and data is another key aspect of SaaS design. You should be able to sample real traffic and use it for testing your application without compromising on user experience or data integrity. Lab testing has severe limitations, especially when you are testing performance and scalability of your application. Real traffic patterns and seasonality of data can only be tested with real traffic. Don’t start your beta until you have tested on real traffic.

Monitor-ability

The more servers and applications that you add to your cluster the more things can fail and in very different ways. For example – network (NIC), memory, disk and many other things. It is extremely important to monitor each of these, and many more, constantly, with alarms using different communication formats (email, SMS, etc.). There are many online services that can be used for monitoring services, and they provide a host of difference services and have widely varying pricing. Amazon too recently introduced CloudWatch, which can monitor various aspects of a host such as CPU Utilization, Disk I/O, Network I/O etc.

As you grown your cluster of server you will need to think of these design aspects and keep on tuning your system. And, like the guys at YouTube said:

Recipe for handling rapid growth

    while (true)
    {
        identify_and_fix_bottlenecks();
        drink();
        sleep();
        notice_new_bottleneck();
     }

About the Author – Mukul Kumar

Mukul Kumar is the Co-Founder & Senior Vice President Engineering at PubMatic. PubMatic, an online advertising company that helps premium publishers maximize their revenue and protect their brands online, has its Research & Development center in Pune.

Mukul is responsible for PubMatic’s Engineering team and resides in Pune, India. Mukul was previously the Director of Engineering at PANTA Systems, a high-performance computing startup. Before that he was at VERITAS India, where he joined as the 13th employee and helped it grow to over 2,000 individuals. Mukul has filed for 14 patents in systems software, storage software, and application software. Mukul is a graduate of IIT Kharagpur with a degree in Electrical Engineering.

Mukul is very passionate about technology, and building world-class teams. His interests include architecting scalable and high-performance web-applications, handling and mining massive amounts of data and system & storage architecture.

Mukul’s email address is mukul at pubmatic.com.

TechWeekend #3: Website Performance, Scalability and Availability: Sept 5

Scalability (Source: Domas Mituzas, Wikipedia)
Click on the image to see other PuneTech articles on Scalability (Image Source: Domas Mituzas, Wikipedia)
What: TechWeekend featuring “Website Scalability and Performance” by Mukul Kumar, VP Engineering at Pubmatic, and “Website Availability and Recovering from Failures and Disasters” by Sameer Anja, Associate Director at KPMG
When: Saturday, 5th Sept, 4pm
Where: Symbiosis Institute of Computer Studies and Research, Atur Centre, Model Colony. Map.
Registration and Fees: This event is free for all to attend. Please register here.

Website Scalability and Performance – Mukul Kumar

Mukul will talk about the various aspects of what it takes to run a very high traffic website – something that he has a lot of experience with at Pubmatic, the ad optimization service for web publishers, where they serve over a billion requests per month.

Mukul Kumar (mukul.kumar [at] pubmatic [dot] com) is a Co-Founder and VP of Engineering at Pubmatic, and Mukul is responsible for PubMatic’s engineering team and resides in Pune, India. Mukul was previously the Director of Engineering at PANTA Systems, a high performance computing startup. Previous to that he joined VERITAS India as the 13th employee and helped it grow to over 2,000 individuals as Director of Engineering for the NetBackup group, Veritas’ main product. He has filed for 14 patents in systems software, storage software, and application software and proudly proclaims his love of Π and can recite it to 60 digits. Mukul is a graduate of IIT Kharagpur with a degree in electrical engineering.

Website Availability and Recovery from Disasters – Sameer Anja

While everyone looks at security and focuses on confidentiality, privacy and integrity; an oft neglected parameter is of availability. While “neglected” may be seem like a strong term, the truth is that we overlook basic data on availability and do not even implement simple to-dos which would help in remediating the situation. The session is aimed at identifying such simple remedies, look at impacts, the assessment model and put forward various scenarios and possible solutions available. The session does not focus on specific products and instead endeavours to use existing technologies used for web site development and how they can be used for ensuring availability. Some principles of disaster recovery will also be covered.

Sameer is a Senior Manager in the IT Advisory practice and is working with KPMG since January 2007 and has 12+ years of work experience in the areas of Information Security, Product design and development, system and network administration. Worked on process and technology areas of Information Security. Worked on Governance and Compliance areas like SOX, Basel II, ISO 15048, SSE -CMM, Data Privacy apart from ISO 27001, Identity Management and Business Continuity design and testing. Experience working with startups and established setups. Speaker at various conferences/ seminars within India and abroad. Trained for six sigma green belt.

Cloud Computing and High Availability

This article discussing strategies for achieving high availability of applications based on cloud computing services is reprinted with permission from the blog of Mukul Kumar of Pune-based ad optimization startup PubMatic

Cloud Computing has become very widespread with startups as well as divisions of banks, pharmaceuticals companies and other large corporations using them for computing and storage. Amazon Web Services has led the pack with it’s innovation and execution, with services such S3 storage service, EC2 compute cloud, and SimpleDB online database.

Many options exist today for cloud services, for hosting, storage and application hosting. Some examples are below:

Hosting Storage Applications
Amazon EC2 Amazon S3 opSource
MOSSO Nirvanix Google Apps
GoGrid Microsoft Mesh Salesforce.com
AppNexus EMC Mozy
Google AppEngine MOSSO CloudFS
flexiscale

[A good compilation of cloud computing is here, with a nice list of providers here. Also worth checking out is this post.]

The high availability of these cloud services becomes more important with some of these companies relying on these services for their critical infrastructure. Recent outages of Amazon S3 (here and here) have raised some important questions such as this – S3 Outage Highlights Fragility of Web Services and this.

[A simple search on search.twitter.com can tell you things that you won’t find on web pages. Check it out with this search, this and this.]

There has been some discussion on the high availability of cloud services and some possible solutions. For example the following posts – “Strategy: Front S3 with a Caching Proxy” and “Responding to Amazon’s S3 outage“.

Here I am writing of some thoughts on how these cloud services can be made highly available, by following the traditional path of redundancy.

[Image: Basic cloud computing architectures config #1 to #3]

The traditional way of using AWS S3 is to use it with AWS EC2 (config #0). Configurations such as on the left can be made to make your computing and storage not dependent on the same service provider. Config #1, config #2 and config #3 mix and match some of the more flexible computing services with storage services. In theory the compute and the storage can be separately replaced by a colo service.

[Image: Cloud computing HA configuraion #4]

The configurations on the right are examples of providing high availability by making a “hot-standby”. Config #4 makes the storage service hot-standby and config #5 separates the web-service layer from the application layer, and makes the whole application+storage layer as hot-standby.

A hot-standby requires three things to be configured – rsync, monitoring and switchover. rsync needs to be configured between hot-standby servers, to make sure that most of the application and data components are up to date on the online-server. So for example in config #4 one has to rsync ‘Amazon S3’ to ‘Nirvanix’ – that’s pretty easy to setup. In fact, if we add more automation, we can “turn-off” a standby server after making sure that the data-source is synced up. Though that assumes that the server provisioning time is an acceptable downtime, i.e. the RTO (Recovery time objective) is within acceptable limits.

[Image: Cloud computing Hot Standby Config #5]
This also requires that you are monitoring each of the web services. One might have to do service-heartbeating – this has to be designed for the application, this has to be designed differently for monitoring Tomcat, MySQL, Apache or their sub-components. In theory it would be nice if a cloud computing service would export APIs, for example an API for http://status.aws.amazon.com/ , http://status.mosso.com/ or http://heartbeat.skype.com/. However, most of the times the status page is updated much later after the service goes down. So, that wouldn’t help much.

Switchover from the online-server/service to the hot-standby would probably have to be done by hand. This requires a handshake with the upper layer so that requests stop and start going to the new service when you trigger the switchover. This might become interesting with stateful-services and also where you cannot drop any packets, so quiscing may have to be done for the requests before the switchover takes place.

[Image: Cloud computing multi-tier config #6]
Above are two configurations of multi-tiered web-services, where each service is built on a different cloud service. This is a theoretical configuration, since I don’t know of many good cloud services, there are only a few. But this may represent a possible future, where the space becomes fragmented, with many service providers.

[Image: Multi-tier cloud computing with HA]
Config #7 is config #6 with hot-standby for each of the service layers. Again this is a theoretical configuration.

Cost Impact
Any of the hot-standby configurations would have cost impact – adding any extra layer of high-availability immediately adds to the cost, at least doubling the cost of the infrastructure. This cost increase can be reduced by making only those parts of your infrastructure highly-available that affect your business the most. It depends on how much business impact does a downtime cause, and therefore how much money can be spent on the infrastructure.

One of the ways to make the configurations more cost effective is to make them active-active configuration also called a load balanced configuration – these configurations would make use of all the allocated resources and would send traffic to both the servers. This configuration is much more difficult to design – for example if you put the hot-standby-storage in active-active configuration then every “write” (DB insert) must go to both the storage-servers, writes (DB insert) must not complete on any replicas (also called mirrored write consistency).

Cloud Computing becoming mainstream
As cloud computing becomes more mainstream – larger web companies may start using these services, they may put a part of their infrastructure on a compute cloud. For example, I can imagine a cloud dedicated for “data mining” being used by several companies, these may have servers with large HDDs and memory and may specialize in cluster software such as Hadoop.

Lastly I would like to cover my favorite topic –why would I still use services that cost more for my core services instead of using cloud computing?

  1. The most important reason would be 24×7 support. Hosting providers such as servepath and rackspace provide support. When I give a call to the support at 2PM India time, they have a support guy picking up my calls – that’s a great thing. Believe me 24×7 support is a very difficult thing to do.
  2. These hosting providers give me more configurability for RAM/disk/CPU
  3. I can have more control over the network and storage topology of my infrastructure
  4. Point #2 above can give me consistent throughput and latency for I/O access, and network access
  5. These services give me better SLAs
  6. Security

About the Author

Mukul Kumar, is a founding engineer and VP of Engineering at Pubmatic. He is based in Pune and responsible for PubMatic’s engineering team. Mukul was previously the Director of Engineering at PANTA Systems, a high performance computing startup. Previous to that he joined Veritas India as the 13th employee and was Director of Engineering for the NetBackup group, one of Veritas’ main products. He has filed for 14 patents in systems software, storage software, and application software and proudly proclaims his love of π and can recite it to 60 digits. Mukul is a graduate of IIT Kharagpur with a degree in electrical engineering.

Mukul blogs at http://mukulblog.blogspot.com/, and this article is cross posted from there.

Zemanta Pixie