(This article on things you need to be careful when designing the architecture of a cloud based Software-as-a-Service offering is a guest post by Mukul Kumar, who, as SVP of Engineering at Pubmatic has a lot of hands-on experience with having designing, building and maintaining a very high performance, high scalability cloud-based service.)
Designing a SaaS software stack poses challenges that are very different from the considerations for host-based software design. The design aspects for performance, scalability, reliability of SaaS with lots of servers and lots of data is very different and interesting from designing a software that is installed on a host and is used by that host.
Here I list the top 5 design elements for Cloud Based SaaS.
High availability
SaaS software stack is built on top of several disparate elements. Most of the times these elements are hosted by different software vendors, such as Rackspace, Amazon, Akamai, etc. The software stack consists of several layers, such as – application server, database server, data-mining server, DNS, CDN, ISP, load-balancer, firewall, router, etc. Highly availability of SaaS actually means thinking about the high availability of all or most of these components. Designing high availability of each of these components is a non-trivial exercise and the cost shoots up as you keep on adding layers of HA. Such design requires thinking deeply about the software architecture and each component of the architecture. Two years back I wrote an article on Cloud High Availability, where I described some of these issues, you can read it here.
Centralized Manageability
As you keep on adding more and more servers to your application cluster the manageability gets hugely complex. This means:
- you have to employ more people to do the management,
- human errors would increase, and
- the rate at which you can deploy more servers goes down.
And, don’t just think of managing the OS on these servers, or these virtual machines. You have to manage the entire application and all the services that the application depends on. The only way to get around this problem is to have centralized management of your cluster. Centralized management is not an easy thing to do, since every application is different, making a generalized management software is oversimplifying the problem and is not a full solution.
Online Upgradability
This is probably the most complex problem after high availability. When you have a cluster of thousands of hosts, live upgradability is a key requirements. When you release a new software revision, you need to be able to upgrade is across the servers in a controlled way, with the ability of rolling it back whenever you want – at the instant that you want, across the exact number of servers that you want. You would also need to control database and cache coherency and invalidation across the cluster is a controlled way. Again, this cannot be solved in a very generic way; every software stack has its own specificity, which needs to be solved in its own specific ways.
Live testability
Testing your application in a controlled way with real traffic and data is another key aspect of SaaS design. You should be able to sample real traffic and use it for testing your application without compromising on user experience or data integrity. Lab testing has severe limitations, especially when you are testing performance and scalability of your application. Real traffic patterns and seasonality of data can only be tested with real traffic. Don’t start your beta until you have tested on real traffic.
Monitor-ability
The more servers and applications that you add to your cluster the more things can fail and in very different ways. For example – network (NIC), memory, disk and many other things. It is extremely important to monitor each of these, and many more, constantly, with alarms using different communication formats (email, SMS, etc.). There are many online services that can be used for monitoring services, and they provide a host of difference services and have widely varying pricing. Amazon too recently introduced CloudWatch, which can monitor various aspects of a host such as CPU Utilization, Disk I/O, Network I/O etc.
As you grown your cluster of server you will need to think of these design aspects and keep on tuning your system. And, like the guys at YouTube said:
Recipe for handling rapid growth
while (true)
{
identify_and_fix_bottlenecks();
drink();
sleep();
notice_new_bottleneck();
}
About the Author – Mukul Kumar
Mukul Kumar is the Co-Founder & Senior Vice President Engineering at PubMatic. PubMatic, an online advertising company that helps premium publishers maximize their revenue and protect their brands online, has its Research & Development center in Pune.
Mukul is responsible for PubMatic’s Engineering team and resides in Pune, India. Mukul was previously the Director of Engineering at PANTA Systems, a high-performance computing startup. Before that he was at VERITAS India, where he joined as the 13th employee and helped it grow to over 2,000 individuals. Mukul has filed for 14 patents in systems software, storage software, and application software. Mukul is a graduate of IIT Kharagpur with a degree in Electrical Engineering.
Mukul is very passionate about technology, and building world-class teams. His interests include architecting scalable and high-performance web-applications, handling and mining massive amounts of data and system & storage architecture.
Mukul’s email address is mukul at pubmatic.com.