Category Archives: Live Blogging

TechWeekend LiveBlog: NoSQL + Database in the Cloud #tw6

This is a quick-and-dirty live-blog of TechWeekend 6 on NoSQL and Databases in the Cloud.

First, I (Navin Kabra) gave an overview of NoSQL systems. Since I was talking, I wasn’t able to live-blog it.

When not to use NoSQL

Next, Dhananjay Nene talked about when to not use NoSQL. Main points:

  • People know SQL. They can leverage it much faster, than if they were to use one of these non-standard interfaces of one of these new-fangled systems.
  • When reporting is very important, having SQL is much better. Reporting systems support SQL. Re-doing that with NoSQL will be more difficult.
  • Consistency, and Transactions are often important. Going to NoSQL usually involves giving them up. And unless you are really, really sure you don’t need them, this issue might come and bite you.
  • If you’re considering using NoSQL, you better know what the CAP theorem is; you better really understand what C, A, and P in that mean; don’t even consider NoSQL until you’re very well versed with these concepts
  • RDBMS can really scale quite a lot – especially if you optimize them well. So 90% of the time, it is very likely that the RDBMS is good enough for your situation and you don’t need NoSQL. So don’t go for NoSQL unless you are really sure that your RDBMS wont scale.

MongoDB the Infinitely Scalable

Next up is BG, talking about MongoDB, the Infinitely Scalable. They are using MongoDB in production for http://paisa.com (Infinitely Beta). The main points he made:

  • Based on the idea that JSON is a well understood format for data, and it is possible to build a database based on JSON as the primary data structuring format.
  • The data is stored on disk using BSON, a binary format for storing JSON
  • Obviously, JavaScript is the natural language for working with MongoDB. So you can use JavaScript to query the database, and also for “stored procedures”
  • MongoDB it does not really allow joins; but with proper structuring of your data, you will not need joins
  • You can do very rich querying, deeply nested, in MongoDB
  • MongoDB has native support for ‘sharding’ (i.e. breaking up your data into chunks to be spread across multiple servers). This is really difficult to do.
  • MongoDB is screaming fast.
  • It is free and open source, but it is also backed by a commercial company, so you can get paid support if you want. There are hosting solutions (including free plans) where you can host your MongoDB instances (e.g. http://mongohq.com)
  • You store “documents” in MongoDB. Since you can’t really do joins, the solution is to de-normalize your data. Everything you need should be in the one document, so you don’t need joins to fetch related data. e.g. if you were storing a blog post in MongoDB, you’ll store the post, all its meta-data, and all the comments in a single document.

MongoDB Use Cases:

  • Great for “web stuff”
  • High Speed Logging (because MongoDB has extremely fast writes)
  • Data Warehousing – great because of the schema flexibility
  • Storing binary files via GridFS – which are queryable!

MongoDB is used in production by these popular services:

FourSquare recently had a major unplanned downtime – because they did not understand how to really MongoDB. That underscores the importance of understanding the guarantees given by your NoSQL system – otherwise you could run into major problems including downtime, or even data loss. See this blog post for more on the FourSquare outage

Some stats about use of MongoDB at paisa.com. 54 million documents. 80GB of data. 6GB of indexes. All of this on 2 nodes (master-slave setup).

Redis

Gautam Rege now talking about his experiences with Redis. Main points made:

  • Redis is a key-value database with an attitude. Nothing more.
  • Important feature: in (key, value), the value can be a list, hash, set.
  • 1 million key lookups in 40ms. Because it keeps data in memory.
  • Persistence is lazy – save to disk every x seconds. So you can lose data in case of a crash. So you need to be sure that your app can handle this.
  • Redis is a “main memory database” (which can handle virtual memory – so your database does not really have to fit in memory)
  • All get and set operations on Redis are atomic. A lot of concurrency problems and race conditions disapper because of atomicity.
  • Sets in redis allow union, intersection, difference. Accessed like a hash.
  • Sorted sets combine hashes and arrays. Can lookup by key, but can also scan sequentially.
  • Redis allows real-time publish-subscribe.
  • Redis is simple. Redis is for specific small applications. Not intended for being the general purpose database for your app. Use where it makes sense. For example:
    • Lots of small updates
    • Vote up, vote down
    • Counters
    • Tagging. Implementing a tagging solution is a pain – becomes easy with Redis
    • Cross-referencing small data
  • Don’t use Redis for ORM (object-relational mapping)
  • Don’t use Redis if memory is limited
  • Sites like digg use Redis for tagging

SQL Azure

Saranya Sriram talking about SQL Azure and data in the cloud. SQL Azure is pretty much SQL Server in the cloud, retrofitted for for the cloud:

  • Exposes a RESTful interface
  • Has language bindings for python, rails, java, etc.
  • Gives full SQL / Relational database in the cloud
  • The standard tools used to access SQLServer locally can also be used to access SQL Azure from the cloud
  • For Azure you get a cloud simulation on your local machine to develop and test your application. For SQL Azure, you simply test with your local SQL Server edition. If you don’t have a SQL Server license, you can download SQL Server Express, which is free.
  • You can develop applications in Microsoft Visual Studio. You can incorporate PHP also in this.
  • You can also use Eclipse for developing applications.
  • SQL Azure has a maximum size limit of 50GB. (Started with 1 GB last year)
  • There is no free plan for Azure. You have to play. “Enthusiasts” can use it free for 180 days. If you sign up for the Bizspark program (for small startups, for the first 3 years) it is free. Similarly students can use it for free by signing up for the DreamSpark program. (Actually, the Bizspark and DreamSpark programs give you free access to lots of Microsoft software.)

LiveBlog #tw5: Intro to Functional Programming & Why it’s important

This is a live-blog of TechWeekend 5 on Functional Programming. Please keep checking regularly, this will be updated once every 15 minutes until 1pm.

Why Functional Programming Matters by Dhananjay Nene

Dhananjay Nene started off with an introductory talk on FP – what it is, and why it is important.

FP is a language in which functions have no side-effects. i.e., the result of a function is purely dependent on its inputs. There is no state maintained.

Effects/Implications of “no side effects”

  • Side-effects are necessary: FP doesn’t mean completely side-effect free. If you have no side-effects, you can’t do IO. So, FP really means “largely side-effect free”. Specifically, there are very few parts of the code that have side-effects, and you know exactly which those are.
  • Testability: Unit Testing becomes much easier. There are no “bizarre interactions” between different parts of the code. “Integration” testing becomes much easier, because there are no hidden effects.
  • Immutability: There are no “variables”. Once a value has been assigned to a ‘name’, that value is ‘final’. You can’t change the value of that ‘name’ since that would be ‘state’ and need ‘side-effects’ to change it.
  • Lazy Evaluation: Since a function always produces the same result, the compiler is free to decide when to execute the function. Thus, it might decide to not execute a function until that value is really needed. This gives rise to lazy evaluation.
  • Concurrency control is not so much of a problem. Concurrency control and locks are really needed because you’re afraid that your data might be modified by someone else while you’re accessing it. This issue disappears if your data is immutable.
  • Easier parallelization: The biggest problem with parallelizing programs is handling all the concurrency control issues correctly. This becomes a much smaller problem with FP.
  • Good for multi-core: As the world moves to multi-core architectures, more and more parallelism will be needed. And humans are terrible at writing parallel programs. FP can help, because FP programs are intrinsically, automatically parallelizable.

Another important feature of functional programming languages is the existence of higher order functions. Basically in FP, functions can be treated just like data structures. They can be passed in as parameters to other functions, and they can be returned as the results of functions. This makes much more powerful abstractions possible. (If you know dependency injection, then higher-order functions are dependency injection on steroids.)

FP gives brevity. Programs written in FP will typically be much shorter than comparable imperative programs. This is probably because of higher-order functions and clojures. Compare the size of the quicksort code in Haskell vs. Java at this page

You need to think differently when you start doing functional programming.

Think different:

  • Use recursion or comprehensions instead of loops
  • Use pattern matching instead of if conditions
  • Use pattern matching instead of state machines
  • Information transformation instead of sequence of tasks
  • Software Transactional Memory FTW!

Advantages of FP:

  • After initial ramp-up issues, development will be faster in FP
  • Code is shorter (easier to read, understand)
  • Clearer expression of intention of developer
  • Big ball of mud is harder to achieve with pure functions. You will not really see comments like “I don’t know why this piece of code works, but it works. Please don’t change it.”
  • Once you get used to FP, it is much more enjoyable.
  • Faster, better, cheaper and more enjoyable. What’s not to like?

The cost of doing FP:

  • Re-training the developers’ brains (this is a fixed cost). Because of having to think differently. Can’t just get this from books. Must do some FP programming.
  • You can suffer from a lack of third-party libraries(?), but if you pick a language like Clojure which sits on the JVM, then you can easily access java libraries for the things that don’t exist natively in your language.

Should a company do it’s next project in a functional programming language? Dhananjay’s recommendation: start with small projects, and check whether you have the organizational capacity for FP. Then move on to larger and larger projects. If you’re sure that you have good programmers, and there happens to be a 6-month project for which you’re OK if it actually becomes a 12-month project, then definitely do it in FP. BG’s correction (based on his own experience): the 6-month project will only become a 8-month project.

Some things to know about Erlang by Bhasker Kode

Bhasker is the CEO of http://hover.in. They use Erlang in production for their web service.

Erlang was created in 1986 by developers at Ericsson for their telecom stack. This was later open-sourced and is now a widely used language.

Erlang is made up of many “processes”. These are programming language constructs – not real operating system processes. But otherwise, they are similar to OS processes. Each process executes independently of other processes. Processes do not share any data. Only message passing is allowed between processes. There are a number of schedulers which schedule processes to run. Normally, you will have as many schedulers as you have cores on your machine. Erlang processes are very lightweight.

Garbage collection is very easy, because as soon as a process dies, all its private date can be garbage collected because this is not shared with anyone else.

Another interesting thing about Erlang is that the pattern matching (which is used in all functional programming languages) can actually match binary strings also. This makes it much easier to deal with binary data packets.

Erlang has inbuilt support and language features for handling failures of processors, and which process takes over the job and so on, supervisor processes, etc.

Erlang allows you to think beyond for loops. Create processes which sit around waiting for instructions from you. And then the primary paradigm of programming is to send a bunch of tasks to a bunch of processes in parallel, and wait for results to come back.

Some erlang applications for developers:

  • Webservers built in erlang: Yaws, mochiweb, nitrogen, misultin
  • Databases built in erlang: amazon simpledb, riak, couch, dynomite, hibari, scalaris
  • Testing frameworks: distil, eunit, quickcheck, tsung

Who is using erlang? Amazon (simpledb), Facebook (facebook chat), microsoft, github, nokia (disco crawler), ea (the games company), rabbitmq (a messaging application), ejabberd (the chat server, which has not crahsed in 10 years). Indian companies using erlang: geodesic, http://hover.in.

How Clojure handles the Expression Problem by Baishampayan Ghose

If you’ve gone deep into any programming language, you will find a reference to lisp somewhere. So, every programmer must be interested in lisp. To quote Eric Raymond:

LISP is worth learning for the profound enlightenment experience you will have when you finally get it. That experience will make you a better programmer for the rest of your days, even if you never actually use LISP itself a lot.

BG had conducted a 2 day Clojure tutorial in Pune a few months back, and he will happily do that again if there is enough interest. This talk is not about the basics of Clojure. It is talking about a specific problem, and how it is solved in Clojure, in the hope that it gives some interesting insights into Clojure.

Clojure is a dialect of lisp. And the first thing that anybody notices about lisp is all the parantheses. Don’t be afraid of the parantheses. After a few days of coding in lisp, you will stop noticing them.

Clojure has:

  • first-class regular expressions. A # followed by a string is a regular expression.
  • arbitrary precision integers and doubles. So don’t worry about the data-type of your numbers. (It internally uses the appropriately sized data types.)
  • code as data and data as code. Clojure (and lisp) is homoiconic. So lisp code is just lists, and hence can be manipulated in the program by your program to create new program constructs. This is the most ‘difficult’ and most powerful part of all lisp based languages. Google for “macros in lisp” to learn more. Most people don’t “get” this for a long time, and when they “get” lisp macros, the suddenly become very productive in lisp.
  • has a nice way to attach metadata to functions. For example, type hints attached to functions can help improve performance
  • possibility of speed. With proper type-hints, Clojure can be as fast as Java

_(Sorry: had to leave the talk early because of some other commitments. Will try to update this article later (in a day or two) based on inputs from other people.)

Cloud computing – Report of IndicThreads Conference on Cloud Computing 2010

(This is a live-blog of the Indic Threads Conference on Cloud Computing, that’s being held in Pune. Since it’s being typed in a hurry, it is not necessarily as coherent and complete as we would like it to be, and also links might be missing. Also, this has notes only on selected talks. The less interesting ones have not been captured; nor have the ones I missed because I had to leave early on day 1.)

This is a live-report of the IndicThreads conference on cloud computing. Click on the logo to see other PuneTech posts about IndicThreads events.

This is the first instance of what will become IndicThreads’ annual conference on Upcoming Technology – and the theme this time is Cloud Computing. It will be a two day conference and you can look at the schedule here, and the profiles of the speakers here.

Choosing your Cloud Computing Service

The first talk was by Kalpak Shah, the founder and CEO of Clogeny, a company that does consulting & services in cloud computing. He gave a talk about the various choices available in cloud computing today, and how to go about picking the one that’s right for you. He separated out Infrastructure as a Service (IaaS) which gives you the hardware and basic OS in the cloud (e.g. Amazon EC2), then Platform as a Service (PaaS) which gives you an application framework on top of the cloud infrastructure (e.g. Google AppEngine), and finally Software as a Service (SaaS) which also gives you business logic on top of the framework (e.g. SalesForce). He gave the important considerations you need to take into account before choosing the right provider, and the gotchas that will bite you. Finally he talked about the business issues that you need to worry about before you choose to be on the cloud or not. Overall, this was an excellent talk. Nice broad overview, lots of interesting, practical and useful information.

Diagram showing overview of cloud computing in...
Everybody is jumping on the cloud bandwagon. Do you know how to find your way around in the maze? Image via Wikipedia

Java EE 6 for the cloud

This talk is more fully captured in a separate PuneTech article.

The next talk is by Arun Gupta about JavaEE in the cloud. Specifically Java EE 6, which is an extreme makeover from previous versions. It makes it significantly easier to deploy applications in the cloud. It is well integrated with Eclipse, NetBeans and IntelliJ, so overall it is much easier on the developer than the previous versions.

Challenges in moving a desktop software to the cloud

Prabodh Navare of SAS is talking about their experiences with trying to move some of their software products to the cloud. While the idea of a cloud is appealing, there are challenges in moving an existing product to the cloud.

Here are the challenges in moving to a cloud based business model:

  • Customers are not going to switch unless the cost saving is exceptional. Minor savings are not good enough.
  • Deployment has to be exceptionally fast
  • High performance is an expectation. Customers somehow expect that the cloud has unlimited resources. So, if they’re paying for a cloud app, they expect that they can get whatever performance they demand. Hence, auto-scaling is a minimum rquirement.
  • Linear scaling is an expectation. But this is much easier said than done. Parallelization of tasks is a big pain. Must do lots of in-memory execution. Lots of caching. All of this is difficult.
  • Latency must be low. Google, facebook respond in a fraction of a second. So, users expect you will to.
  • If you’re using Linux (i.e. the LAMP stack), then, for achieving some of thees things, you’ll need to use Memcache, Hadoop..
  • You must code for failure. Failures are common in the cloud (at those scales). And you’re system needs to be designed to seamlessly recover from this.
  • Is customer lock-in good or bad? General consensus in cloud computing market is that data lock-in is bad. Hence you need to design for data portability.
  • Pricing: Deciding the price of your cloud based offering is really difficult.
    • Cost of the service per customer is difficult to judge (shared memory used, support cost, CPU consumed, bandwidth consumed)
    • In Kalpak’s talk, he pointed this out and one of the inhibitors of cloud computing for business
    • Customers expect pay-as-you-go. This needs a full-fledged effort to build an appropriate accounting and billing system, and it needs to be grafted into your application
    • To support pay-as-you-go effectively, you need to design different flavors of the service (platinum, gold, silver). It is possible that this might not be easy to do with your product.

Multi-cloud programming with jCloud

This talk is by Vikas Hazrati, co-founder and “software craftsman” at Inphina.

Lots of people are interested in using the cloud, but one of the things holding them back is cloud vendor lock-in. If one cloud doesn’t work out, they would like to be albe to shift to another. This is difficult.

To fix this problem, a bunch of multi-cloud libraries have been created which abstract out the clouds. Basically they export an API that you can program to, and they have implementations of their API on a bunch of major cloud providers. Examples of such multi-cloud libraries/frameworks are: Fog, Delta, LibCloud, Dasein, jCloud.

These are the things that are different from cloud to cloud:

  • Key-value store (i.e. the database)
  • File sizes
  • Resumability (can you stop and restart an application)
  • CDN (content delivery network)
  • Replication (some clouds have it and some don’t)
  • SLA
  • Consistency Model (nobody gives transaction semantics; everybody gives slightly different eventual consistency semantics)
  • Authorization
  • API complexity

APIs like jCloud try to shield you from all the differences in these.

jCloud allows a common API that will work on Amazon, Rackspace, VMWare and a bunch of other cloud vendors. It’s open source, performant, is based on closure, and most importantly, it allows unit testability across clouds. The testability is good because you can test without having to deploy on the cloud.

The abstractions provided by jCloud:

  • BlobStore (abstracts out key-value storage for: atmos, azure, rackspace, s3)
  • Compute (abstracts out vcloud, ec2, gogrid, ibmdev, rackspace, rimu)
  • Provisioning – adding/removing machines, turning them on and off

jCloud does not give 100% portability. It gives “pragmatic” portability. The abstraction works most of the time, but once in a while you can access the underlying provider’s API and do things which are not possible to do using jCloud.

A Lap Around Windows Azure

Diagram explaining the Windows Azure structure...
Windows Azure is Microsoft’s entry in the PaaS arena. Image via Wikipedia

This talk is by Janakiram, a Technical Architect (Cloud) at Microsoft.

Microsoft is the only company that plays in all three layers of the cloud:

  • IaaS – Microsoft System Center (Windows HyperV). Reliance is setting up a public cloud based on this in India.
  • PaaS – Windows Azure Platform (AppFabric, SQLAzure)
  • SaaS – Microsoft Online Services (MSOffice Web Application, MSExchange Online, MSOffice Communications Online, SharePoint Online)

The focus of this talk is the PaaS layer – Azure. Data is stored in SQLAzure, the application is hosted in Windows Azure and AppFabric allows you to connect/synchronize your local applications and data with the cloud. These together form the web operating system known as Azure.

The cloud completely hides the hardware, the scalability, and other details of the implementation from the developer. The only things the cloud exposes are: 1) Compute, 2) Storage, and 3) Management.

The compute service can have two flavors. There’s a “Web Role” is essentially the UI – it shows webpages and interacts with the user – based on IIS7. The “Worker Role” does not have a UI, and is expected to be “background” processes, often long-running, and operate on the storage directly. You can have Java Tomcat, Perl, Python, or whatever you want to run inside of a worker role. They demonstrated wordpress working on Azure – by porting mysql, php, and wordpress to the platform. Bottomline: you can put anything you want in a worker role.

Azure storage exposes a Blob (very much like S3, or any other cloud storage engine). This allows you to dump your data, serialized, to the disk. This can be combined with a CDN service to improve availability and performance. In addition you can use tables for fast read mostly access. And it gives you persistent queues. And finally, you get “Azure Drive”, a way to share raw storage across your apps. And all of this is available via a REST interface (which means that any app, anywhere on the web can access the data – not just .NET apps).

Building an Azure application is no different from designing, developing, debugging, and testing an ASP.NET application. There is a local, simulated cloud interface that allows you to try everything out locally before deploying it to the cloud.

Links for azure: http://www.microsoft.com/windowsazure, and http://msdn.microsoft.com/azure.

The slides will be uploaded at Janakiram’s blog: http://www.janakiram.net.

He exists on twitter as @janakiramm

Amazon EC2

Simone Brunozzi, a Technology Evangelist at Amazon Web Services, is talking about Amazon’s EC2.

Amazon Web Services logo
AWS has the broadest spectrum of services on offer in the cloud computing space, and the best partner/developer/tools ecosystem. Image via Wikipedia

Overview of Amazon Web Services: Compute(EC2, Elastic MapReduce, AutoScaling), Messaging(SQS, simple notification service), Storage (S3, EBS, import/export), Content Delivery (CloudFront), Monitoring (CloudWatch), Support, Database (SimpleDB, RDBMS), Networking (Virtual Private Cloud, Elastic Load Balancing), Payments & Billing (FPS – flexible payments service), e-Commerce (fws – fulfillment web service, Amazon DevPay), Web Traffic (Alexa Web Information, Alexa Top sites), Workflow (Amazon Mechanical Turk)!! See this link for more

AWS exists in US-West (2 locations), US-East (4 locations), Europe (2 locations), Asia-Pacific (2 locations). It’s architected for redundancy, so you get availability and failover for free.

EC2 essentially gives you virtual servers in the cloud, that can be booted from a disk image. You can choose your instance type from small to extra-large (i.e. how much memory, and CPU speed), and install an image on that machine. You can choose from a lot of pre-configured images (Linux, Solaris, Windows). These are basic OS installs, or more customized versions created by Amazon or the community. You can further customize this as you want, because you obviously, get root/administrator access on this computer. Then you can attach a “disk” to this “computer” – basically get an EBS, which is 1GB to 1TB in size. An EBS device is persistent, and is automatically replicated. If you want even better durability, then snapshot the EBS and store it to S3.

Scaling with EC2: Put an ELB (Elastic Load Balancer) in front of your EC2 instances, and it will automatically load balance across those (and give you a single URL to expose to your users). In addition, ELB does health-checks on the worker instances and removes the ones who are not performing up to the mark. If you use the CloudWatch monitoring service, you can do things like: “if average CPU usage across all my instances is above 80%, then add a new instance, and remove it once average CPU usage drops below 20%.” After this point, adding and removing instances will be fully automated.

It’s important to mention that AWS is very enterprise ready: it has certifications and security (needed for banking apps), SLAs, worldwide ecosystem of service integrators, and other partners. Another enterprise feature: Virtual Private Clouds. Basically, carve out a small area in the Amazon Public cloud which is only accessible through a VPN, and this VPN ends in the enterprise. Hence, nobody else can access that part of the cloud. (Note: this is different from a Private Cloud, which is physically located in the enterprise. Here, it is still physically located somewhere in Amazon’s data-centers, but a VPN is used to restrict access to one enterprise.)

Multi-tenancy in the Cloud

Multi-Tenant vs. Single-Tenant Architecture
Difference between single-tenant and multi-tenant apps. Image by andreasvongunten.com via Flickr

Vikas Hazrati (referenced earlier in this post), is talking about multi-tenancy. How do you give two different customers (or groups of customers) the impression that they are the only ones using a particular instance of a SaaS; but actually you’re using only one installation of the software.

Multi-tenancy is basically when you have a single infrastructure setup (software stack, database, hardware), but you want multiple groups to use it, and each should see a completely isolated/independent view. Basically, for security, one customer group does not want their data to be visible to anybody else. But we don’t want to give each group their own instance of the infrastracture, because that would be too expensive.

Variations on multi-tenancy. Easiest is to not do multi-tenancy – have separate hardware & software. Next step is to have multiple virtual machines on shared hardware. So hardware shared, software is not. If you’re sharing the middleware, you can do the following: 1. Multiple instances of the app on the same OS with independent memory, 2. Multiple instances of the app with shared memory, and 3. True multi-tenancy.

What level do you need to do multi-tenancy at? It could be at any layer: the database of course needs to be separable for different tenants. You can also do it at the business logic layer – so different tenants want different configurations of the business logic. And finally, you could also do this at the presentation logic – different tenants want different look’n’feel and branding.

Multi-tenancy in the database. Need to add a tenant-id to the database schema (and the rest of the schema is the same). A bit customer concern in this is that bugs in queries can result in data-leakage (i.e. a single poorly written query will result in your competitor seeing your sales leads data). This can be a huge problem. A typical SaaS vendor does this: put smaller customers in the same database with tenant-id, but for larger customers, offer them the option of having their data in a separate database.

Multi-tenancy in the cloud. This is really where the cloud shines. Multi-tenancy gives very low costs; especially compared to the non-multi-tenant version (also known as the on-premise version). For example, the cost of multi-tenant JIRA is $10 per month, while the on-premise version is $150 per month (for the same numbers of users).

Multi-tenancy example: SalesForce does a very fine-grained approach. Each user gets his own portion of the database based on primary-key. And there is a validation layer between the app and the database which ensures that all queries have a tenant-id. Fairly fine-grained, and fairly secure. But it is quite complex – lots of design, lots of thinking, lots of testing.

One big problem with multi-tenancy is that of the runaway customers. If a few customers are really using a large share of the resources, then the other customers will suffer. Limiting their resource usage, or moving them elsewhere are both difficult to do.

In general, some providers believe that having each app developer implement multi-tenancy in the app is inefficient. The solution to this is to virtualize the database/storage/other physical resources. In other words, for example, the database exports multiple virtual databases, one per tenant, and the underlying multi-tenant-database handles all the issues of multi-tenancy. Both Amazon’s RDS and Windows SQLAzure provide this service.

Google released the namespaces API for Google AppEngine just a few days back, and that takes a different approach. The multi-tenancy is handled at the highest level of the app, but there’s a very easy way of specifying the tenant-id and everything else is handled by the platform. However, note that multi-tenancy is currently supported only for 3 of their services, and will break if you use one of the others.

Issues in multi-tenancy:

  • Security: all clients are worried about this
  • Impact of other clients: customers hogging resources is still not a solved problem
  • Some customers are willing to pay for a separate instance: and it’s a pain for us to implement and manage
  • Multi-tenancy forces users to upgrade when the app is upgraded. And many customers don’t want to be upgraded forcefully. To handle this issue, many SaaS providers make new features available only via configuration options, not as a forced upgrade.
  • Configurations/customizations can only be done upto some level
  • There is no user acceptance testing. We test, and users have to take it when we make it live.

When should you not use multi-tenancy?

  • Obviously, when security is a concern. e.g. Google not doing this for government data
  • High customization and tight integration will make all the advantages of multi-tenancy disappear

SaaS-ifying a traditional application

Chirag Jog, CTO at Clogeny Technologies, a PICT ex-student, talking about the choices and issues faced in converting a traditional application to a SaaS, based on a real-life scenario they faced. The case study is of a customer in Pune, who was using his own infrastructure to host a standard web app (non-cloud), and occasional spikes in user requests would cause his app to go down. Dedicated hosting was too expensive for him – hence the need to move it to the cloud.

Different choices were: SaaS on top of shared infrastructure (like slicehost), or SaaS on top of PaaS (like AppEngine), or SaaS on top of IaaS (Amazon EC2, Rackspace). PaaS seems great, but real-life problems make you re-think: Your existing app is written in a language (or version of a language) that is not supported on the PaaS. Or has optimizations for a local deployment. Specific libraries might be missing. Thus, there’s lots of code change, and lots of testing, and stability will be a problem.

Hence, they decided to go with SaaS on IaaS. Basically simply moving the existing local app to the same software stack on to a server in IaaS. The app itself was largely compute intensive, so they decided to use the existing app as a ‘server’ and built a new client that talks to the server and serves up the results over the web. For this, the server(s) and client went on Amazon EC2 instances, Simple Queueing Service (SQS) was used to communicate between the ‘client’ and the ‘server’, and automatic scaling was used to scale the app (multiple compute servers). This not only helped the scalability & load balancing, but they were able to use this to easily create multiple classes of users (queue cheapo users, and prioritize priority customers) – improved business logic!

Cloud Security – Threats and Mitigations

Internet - Good Or Bad?
If there’s a malicious hacker in your cloud, you could be in trouble. And it is your responsibility, not the cloud vendor. Image by Mikey G Ottawa via Flickr

Vineet Mago and Naresh Khalasi (the company that @rni and @dnene are associated with) are talking about the privacy and security issues they faced in putting their app on the cloud, and how to deal with those.

Good thing about the cloud is that the cloud vendor takes care of all the nitty-gritty, and the developer need not worry about it. The presenters disagree – especially where privacy and security are concerned. You have to worry about it. It is your responsibility. And if you’re not careful, you’ll get into trouble, because the vendors are not. Protecting against malicious hackers is still your responsibility; cloud vendor doing nothing about it.

The Cloud Security Alliance publishes best practices for security in the cloud. Recommneded.

You need to worry about the following forms of security:

  • Physical security: who gets into the building? cameras?
  • Network security: firewall, anti-DDoS, authorization controls
  • System security: anti-virus, active directory, disabling USB
  • Application security: AAA, API security, release management (no release goes out without proper audits and checks by different departments)

And there are three aspects you need to think about:

  • Confidentiality: can your data get into the wrong hands? What if cloud provider employee gets his hands on the data?
  • Integrity: can the data be corrupted? Accidentally? Maliciously?
  • Availability: Can someone make your data unavailable for a period of time? DDoS?

Remember, if you’re using the cloud, the expectation is that you can do this with a very small team. This is a problem because the effort to take into account the security aspects doesn’t really reduce. It increases. Note: it is expected that a team of 2 people can build a cloud app (R&D). However, if a networked app needs to be deployed securely, we’d expect the team size to be 30.

State of the art in cloud security:

  • IaaS: provider gives basic firewall protection. You get nothing else.
  • PaaS: Securing the infrastructure (servers, network, OS, and storage) is the provider’s responsibility. Application security is your responsibility
  • SaaS: Network, system and app security is provider’s job. SLAs, security, liability expectation mentioned in agreements. Best. But least flexibility for developers.

Problems with cloud security:

  • Unknown risk profile: All of them, IaaS, PaaS, and SaaS, are unknowns as far as security is concerned. This industry is just 4 years old. There are areas that are dark. What to do?
    • Read all contracts/agreements carefully and ask questions.
    • Ask provider for disclosure of applicable logs and data.
    • Get partial/full disclosure of infrastructure details (e.g. patch levels, firewalls, etc.)
  • Abuse and nefarious use of cloud computing: Applies to IaaS, PaaS. If hackers are using using Amazon EC2 instances to run malware, then there are two problems. First, malware could exploit security loop-holes in the virtualization software and might be able to access your virtual machine which happens to be on the same physical machine as the hacker’s virtual machine. Another problem is that the provider’s machines/IP-addresses enter public blacklists, and that will cause problems. What to do?
    • Look for providers that have strict initial registration requirements.
    • Check levels of credit card fraud monitoring and co-ordination used by the provider
    • Is the provider capable of running a comprehensive introspection of customer network traffic?
    • Monitor public blacklists for one’s own network IP blocks
  • Insecure Interfaces and APIs: 30% of your focus in designing an API should go into building a secure API. e.g. Twitter API does not use https. So anybody at this conference today could sniff the wi-fi here, sniff the network traffic, get the authentication token, and run a man-in-the-middle attack. Insecure API. What to do?
    • Analyze the security model of cloud provider’s interfaces
    • Build limits into your apps to prevent over-use of your apps
  • Malicious Insiders: “In 3 years of working at company, you’ll have the root passwords of all the servers in the company!” Is your security policy based on the hope that all Amazon AWS employees are honest? What to do?
    • Know what are the security breach notification processes of your provider, and determine your contingency plans based on that information
    • Read the fine print in the contracts/agreements before deciding on cloud vendors
  • Shared Technology Issues: There are various ways in which a malicious program in a virtual machine can access underlying resources from the hypervisor and access data from other virtual machines by exploiting security vulnerabilities. What to do?
    • Implement security best practices for your virtual servers
    • Monitor environment for unauthorized changes/activity
    • Enforce vendor’s SLAs for patching and vulnerability remediation
    • Example: Amazon allows you to run penetration testing, but you need to request permission to do that

Summary

Overall, a good conference. Some great talks. Not too many boring talks. Quality of attendees was quite good. Met a bunch of interesting people that I hadn’t seen in POCC/PuneTech events. You should be able to find slides of all the talks on the conference website.

Live-Blog: Overview of High Performance Computing by Dr. Vipin Chaudhary

(This is a live-blog of Dr. Vipin Chaudhary talk on Trends in High Performance Computing, organized by the IEEE Pune sub-section. Since this is being typed while the talk is going on, it might not be as well organized, or as coherent as other PuneTech articles. Also, links will usually be missing.)

Dr. Vipin Chaudhary, CEO of CRL
Live-blog of a talk by Dr. Vipin Chaudhary, CEO of CRL, on High Performance Computing at Institute of Engineers, Pune. CRL are the makers of Eka, one of the world's fastest privately funded supercomputers. For more information about HPC and CRL, click on the photo above.
Myths about High Performance Computing:

  • Commonly associated with scientific computing
  • Only used for large problems
  • Expensive
  • Applicable to niche areas
  • Understood by only a few people
  • Lots of servers and storage
  • Difficult to use
  • Not scalable and reliable

This is not the reality. HPC is:

  • Backbone for national development
  • Will enable economic growth. Everything from toilets to potato chips are designed using HPC
  • Lots of supercomputing is throughput computing – i.e. used to solve lots of small problems
  • “Mainstream” businesses like Walmart, and entertainment companies like Dreamworks Studioes use HPC.
  • _(and a bunch of other reasons that I did not catch)

China is really catching up in the area of HPC. And Vipin correlates China’s GDP with the development of supercomputers in China. Point: technology is a driver for economic growth.  We need to also invest in this.

Problems solved using HPC:

  • Movie making (like avatar)
  • Real time data analysis
    • weather forecasting
    • oil spill impact analysis
    • forest fire tracking and monitoring
    • biological contamination prediction
  • Drug discover
    • reduce experimental costs through simulations
  • Terrain modeling for wind-farms
    • e.g. optimized site selection, maintenance scheduling
    • and other alternate energy sources
  • Geophysical imaging
    • oil industry
    • earthquake analysis
  • Designing airplanes (Virtual wind tunnel)

Trends in HPC.

The Manycore trend.

Putting many CPUs inside a single chip. Multi-core is when you have a few cores, manycore is when you have many, many cores. This has challenges. Programming manycore processors is very cumbersome. Debugging is much harder. e.g. if you need to get good performance out of these chips then you need to do parallel, assembly programming. Parallel programming is hard. Assembly programming is hard. Both together will kill you.

This will be one of the biggest challenges in computer science in the near future. A typical laptop might have 8 to 10 processses running concurrently. So there is automatic parallelism, as long as number of cores is less than 10. But as chips get 30, 40 cores or more, individual processes will need to be parallel. This will be very challenging.

Oceans of Data but the Pipes are Skinny

Data is growing fast. In sciences, humanities, commerce, medicine, entertainment. The amount of information being created in the world is huge. Emails, photos, audio, documents etc. Genomic data (bio-informatics) data is also huge.

Note: data is growing way, way faster than Moore’s law!

Storing things is not a problem – we have lots of disk space. Fetching and finding stuff is a pain.

Challenges in data-intensive systems:

  • Amount of data to be accessed by the application is huge
  • This requires huge amounts of disk, and very fat interconnects
  • And fast processors to process that data

Conventional supercomputing was CPU bound. Now, we are in the age of data-intensive supercomputing. Difference: old supercomputing had storage elsewhere (away from the processor farm). Now the disks have to be much closer.

Conventional supercomputing was batch processed. Now, we want everything in real-time. Need interactive access. To be able to run analytic and ad hoc queries. This is a new, and difficult challenge.

While Vipin was faculty in SUNY Buffalo, they started an initiative for data-intensive discovery initiative (Di2). Now, CRL is participating. Large, ever-changing data sets. Collecting and maintaining data is of course major problem, but primary focus of Di2 is to search in this data. e.g. security (find patterns in huge logs user actions). This requires a new, different architecture from traditional supercomputing, and the resulting Di2 system significantly outperforms the traditional system.

This also has applications in marketing analysis, financial services, web analytics, genetics, aerospace, and healthcare.

High Performance Cloud Services at CRL

Cloud computing makes sense. It is here to stay. But energy consumption of clouds is a problem.

Hence, CRL is focusing on a green cloud. What does that mean?

Data center optimization:

  • Power consumption optimization on hardware
  • Optimization of the power system itself
  • Optimized cooling subsystem
  • CFD modeling of the power consumption
  • Power dashboards

Workflow optimization (reduce computing resource consumption via efficiencies):

  • Cloud offerings
  • Virtualizations
  • Workload based power management
  • Temperature aware distribution
  • Compute cycle optimization

Green applications being run in CRL

  • Terrain modeling
  • Wind farm design and simulation
  • Geophysical imaging
  • Virtual wind tunnel

Summary of talk

  • Manycore processors are here to stay
    • Programmability have to improve
    • Must match application requirements to processor architecture (one size does not fit all)
  • Computation has to move to where the data is, and not vice versa
  • Data scale is the biggest issue
    • must co-locate data with computing
  • Cloud computing will continue to grow rapidly
    • Bandwidth is an issue
    • Security is an issue
    • These issues need to be solved

Event report: POCC session on cloud apps for your startup

This is a live-blog of the Pune Open Coffee Club session on use of cloud apps for your business. Since this is being typed as the session is in progress, it might be a bit incoherent and not completely well-structured, and there are no links.

Pune Open Coffee Club is an informal group for all those interested in Pune's startup ecosystem. As of this writing, it has more than 2700 members. Click on the image to get all PuneTech articles related to the Pune Open Coffee Club

This session is being run as a panel discussion. Santosh Dawara is the moderator. Panelists are:

  • Dhananjay Nene, Independent Software Architect/Consultant
  • Markus Hegi, CEO of CoLayer
  • Nitin Bhide, Co-founder of BootstarpToday, a cloud apps provider
  • Basant Rajan, CEO of Coriolis, which makes the Colama virtual machine management software
  • Anthony Hsiao, Founder of Sapna Solutions

The session started with an argument over the defintion of cloud, SaaS, etc., which I found very boring and will not capture here.

Later, Anthony gave a list of cloud apps used by Sapna Solutions:

  • Google apps for email, calendaring, documents
  • GitHub for code
  • Basecamp for project management
  • JobScore for recruitment (handles job listings on your website, and the database of applicants, etc.)
  • GreyTip (Indian software for HR management)

Question: Should cloud providers be in the same country?
Answer: you don’t really have a choice. There are no really good cloud providers in India. So it will be outside.

Question: Are customers ready to put their sensitive data on the cloud?
Audience comment: Ashish Belagali has a startup that provides recruitment software. They can provide it as installable software, and also as a hosted, could app. However, they’ve found that most customers are not interested in the cloud app. They are worried about two things: a) The software will be unavailable if internet is not available, and b) The data is outside the company premises.

Point by Nitin Bhide of BootstrapToday: Any cloud provider will take security of your data very seriously. Because, if they screw this up even once, they’ll go out of business right away. Also, as far as theft of data is concerned, it can happen even within your own premises, by your own employees.

Comment 1: Yes, the above argument makes logical sense. But most human beings are not logical, and can have an irrational fear and will defend their choice.

Comment 2: This fear is not irrational. There are valid reasons to be unhappy about having your sensitive data in the cloud.

Comment 3: Another reason why this fear is not irrational is to do with CYA: cover-your-ass. If you put data in the cloud and something goes wrong, you will be blamed. If you put the data locally and something goes wrong, you can claim that you did everything that was expected of you. As long as CYA exists (especially in enterprises), this will be a major argument against the cloud.

Question: Does anybody use accounting packages in the cloud?
Answer: No. Most people prefer to stick to Tally, because of its compliance with Indian laws (or at least its compliance with Indian CAs). There doesn’t seem to be any online alternative that’s good enough.

At this point there was a longish discussion about the availability and uptime of the cloud services. Points made:

  • Cloud app providers have lots of redundancies and lots of backups to ensure that there is no downtime
  • However, there are enough instances of even world-class providers having downtime
  • Also, most of them claim redundancies, but give no guarnatees or SLAs, and even if they do give an SLA, you’re too small a player to enforce the SLA.
  • Also remember, that in the Indian context, downtime of the last mile of your internet will result in downtime of your app
  • Point to remember is that an app going down it not the real problem. The real problem is recovery time. How long does it take before it comes back up? Look at that before choosing upon your app.
  • It would be great if there was a reputation service for all cloud apps, which gives statistics on availability, downtime, performance etc. There isn’t right now, and that is a problem.
  • Remember, there is an economic cost of cloud apps that you will incur due to downtime, but also remember that there are definite economic savings too. For many startups the savings outweigh the potential costs. But you need to look into this for yourself.

Question: What kind of cost savings can a startup get by going to the cloud?

Nobody had concrete answers, but general points made:

  • Can you really afford to pay a system administrator who is competent, and who can administer a mail server, a file server, a this, and a that? There were some people who said that while admins are expensive in the US, they are not that expensive in India. However, more people felt that this would be expensive.
  • All significant large cloud services cost a very tiny fraction of what it would cost to do it yourself.
  • It is not a question of cost. As a startup, with my limited team, I wouldn’t have time to do this.

Basant Rajan points out that so far the discussion has been about either something that is in the cloud, or it is something that you do entirely yourself. These are not the only options. There is a third option – called managed services, or captive clouds. He points out that there is a Pune company called Mithi software that offers a whole bunch of useful services that they manage, on their machines, in your premises.

Question: What about compatibility between your apps? If the recruitment app needs to talk to your HR app are you in trouble?
Answer: The good ones already talk to each other. But yes, if you are not careful, you could run into trouble.

Some Pune startups who are providing cloud based apps:

Pune startup BootstrapToday provides an all-in-one solution in the cloud for development:

  • Source code control (using SVN). All the rest of these services are home grown.
  • Wiki pages
  • Bug tracking
  • Project management
  • Time Tracking (coming soon)
  • Project Tracking (coming soon)

Pune startup Acism has developed an in-house tool for collaboration and project communication which they are making available to others.

Pune startup CoLayer has been around for a long time, and has a product for better collaboration within an enterprise. It is like Google Wave, but has been around for longer, and is still around (while Wave is not).

Pune startup Colama offers private clouds based on virtualization technology. They are currently focusing on software labs in educational institutions as customers. But this technology can also be used to create grids and private clouds for development, testing and training.

Recommendations for cloud apps:

General recommendation: if you’re not using Google Apps, you must. Mail, Documents (i.e. Office equivalent functionality), Calendar.

Bug Tracking: Jira (very good app, but expensive), Pivotal Tracker (only for those familiar with agile, suggested by @dnene), Lighthouse App (suggested by: @anthonyhsiao), Mantis.

Project Management: ActiveCollab (self hostable), DeskAway, SugarCRM on Google Apps (very good CRM, very good integration with Google Apps, has a learning curve).

For hosting your own cloud (i.e. bunch of servers with load balancing etc.): Rackspace Cloud is good but expensive. Amazon Web Services is cost effective, but has a learning curve.

Unfortunately, due to time constraints, this part of the session got truncated. Hopefully we’ll have some more time in the end to pick this up again.

IndicThreads conference pass giveaway

IndicThreads will give a free pass to their Cloud Computing conference that is scheduled for 20/21 August to the best blog or tweet either about this POCC event, or about Cloud Computing in general. The pass is normally worth Rs. 8500. To enter, tweets and blogs should be brought to the attention of @indicthreads on twitter, or conf@rightrix.com. This PuneTech blog is not eligible for the free pass (because I already have a pass), so the field is still open 🙂

Tech Trends for 2015, by Anand Deshpande, Shridhar Shukla, Monish Darda

On Monday, I participated in a Panel Discussion “Technology Trends” organized by CSI Pune at MIT college. The panelists were Anand Deshpande, CEO of Persistent Systems, Shridhar Shukla, MD of GS Lab, Monish Darda, GM of BladeLogic India (which is now a part of BMC Software), and me.

Anand asked each of us to prepare a list of 5 technology trends that we felt would be important in the year 2015, and then we would compare and contrast our lists. I’ve already published my own list of 5 things for students to focus on last week. Basically I cheated by listing a just a couple of technology trends, and filled out the list with one technology non-trend, and a couple of non-technology non-trends.

Here are my quick-n-dirty notes of the other panelists tech trends, and other points that came up during the discussion.

Here is Shridhar’s list:

  • Shridhar’s trend #1: Immersive environments for consumers – from games to education. Partial virtual reality. We will have more audio, video, multi-media, and more interactivity. Use of keyboards and menu driven interfaces will reduce. Tip for students based on trend #1: don’t look down on GUIs. On a related note, sadly, none of the students had heard of TED. Shridhar asked them all to go and google it and to checking out “The Sixth Sense” TED video.
  • Shridhar’s trend #2: totally integrated communication and information dissemination.
  • Shridhar’s trend #3: Cloud computing, elastic computing. Computing on demand.
  • Shridhar’s trend #4: Analytics. Analytics for business, for government, for corporates. Analyzing data, trends. Mining databases.
  • Shridhar’s trend #5: Sophisticated design and test environments. As clouds gain prominence, large server farms with hundreds of thousands of servers will become common. As analytics become necessary, really complicated, distributed processes will run to do the complex computations. All of this will require very sophisticated environments, management tools and testing infrastructure. Hardcore computer science students are the ones who will be required to design, build and maintain this.

Monish’s list:

  • Monish’s trend #1: Infrastructure will be commoditized, and interface to the final user will assume increasing importance
  • Monish’s trend #2: Coming up with ideas – for things people use, will be most important. Actually developing the software will be trivial. Already, things like AWS makes a very sophisticated server farm available to anybody. And lots of open source software makes really complex software easy to put together. Hence, building the software is no longer the challenge. Thinking of what to build will be the more difficult task.
  • Monish’s trend #3: Ideas combining multiple fields will rule. Use of technology in other areas (e.g. music) will increase. So far, software industry was driven by the needs of the software industry first, and then other “enterprise” industries (like banking, finance). But software will cross over into more and more mainstream uses. Be ready for the convergence, and meeting of the domains.
  • Monish’s trend #4: Sophisticated management of centralized, huge infrastructure setups.

Anand’s list:

  • Anand’s trend #1: Sensors. Ubiquitous tiny computing devices that don’t even look like computers. All networked. And
  • Anand trend #2: The next billion users. Mobile. New devices. New interfaces. Non-English interfaces. In fact, non-text interfaces.
  • Anand’s trend #3: Analytics. Sophisticated processing of large amounts of data, and making sense out of the mess.
  • Anand’s trend #4: User interface design. New interfaces, non-text, non-keyboard interfaces. For the next billion users.
  • Anand’s trend #5: Multi-disciplinary products. Many different sciences intersecting with technology to produce interesting new products.

These lists of 5 trends had been prepared independently, without any collaboration. So it is interesting to note the commonalities. Usability. Sophisticated data analysis. Sophisticated management of huge infrastructure setups. The next billion users. And combining different disciplines. Thinking about these commonalities and then wondering about how to position ourselves to take advantage of these trends will form the topic of another post, another day.

Until then, here are some random observations. (Note: one of the speakers before the panel discussion was Deepak Shikarpur, and some of these observations are by him)

  • “In the world of Google, memory has no value” – Deepak
  • “Our students are in the 21st century. Teachers are from 20th century. And governance is 19th century” -Deepak
  • “Earning crores of rupees is your birthright, and you can have it.” – Deepak
  • Sad. Monish asked how many students had read Isaac Asimov. There were just a couple
  • Monish encouraged students to go and read about string theory.
Reblog this post [with Zemanta]

Web Scalability and Performance – Real Life Lessons (Pune TechWeekend #3)

Last Saturday, we had TechWeekend #3 in Pune, on the theme of Website Scalability and Performance.  Mukul Kumar, co-founder, and VP of Engineering at Pubmatic, talked about the hard lessons in scalability they learnt on their way to building a web service that serves billions of ad impressions per month.

Here are the slides used by Mukul. If you cannot see the slides, click here.
Web Scalability & Performance

The talk was live-tweeted by @punetechlive and @d7ylive. Here are a few highlights from the talk:

  • Keep it simple: If you cannot explain your application to your sales staff, you probably won’t be able to scale it!
  • Use JMeter to monitor performance, to a good job of scaling your site
  • Performance testing idea: Take 15/20 Amazon EC2 servers, run JMeter with 200threads on each for 10 hours. Bang on your website! (a few days later, @d7y pointed out that using openSTA instead of JMeter can give you upto 500 threads per server even on old machines.)
  • Scaling your application: have a loosely coupled, shared nothing, stateless, distributed architecture
  • Mysql scalability tip: Be careful before using new features, or new versions. Or don’t use them at all!
  • Website scalability: think global. Some servers in California, some servers in London, etc. Similarly, think global when designing your app. Having servers across the world will drive architecture decisions. When half your data-center is 3000 miles from the other half, interesting, non-trivial problems start cropping up. Also, think carefully about horizontal scaling (lots of cheap servers) vs vertical scaling (few big fat servers)
  • memcache tip: pre-populate memcache with most common objects
  • Scalability tip: Get a hardware load balancer (if you can afford one). Amazon AWS has some load-balancers, but they don’t perform so well
  • Remember the YouTube algo for scaling:
    while(1){
    identify_and_fix_bottlenecks();
    eat_drink();
    sleep();
    notice_new_bottleneck();
    }

    there’s no alternative to this.
  • Scalability tip: You can’t be sure of your performance unless you test with real load, real env, real hardware, real software!
  • Scalability tip – keep the various replicated copies of data loosely consistent. Speeds up your updates. But, figure out which parts of your database must be consistent at all times, and which ones can have “eventual consisteny”
  • Hard lessons: keep spare servers at all times. Keep servers independent – on failure shouldn’t affect other servers
  • Hard lessons: Keep all commands in a script. You will have to run them at 2am. Then 3am. Then 7am.
  • Hard lessons: Have a well defined process for fault identification, communication and resolution (because figuring these things out at 2am, with a site that is down, is terrible.)
  • Hard lessons: Monitor your web service from 12 cities around the world!
  • Hard lesson, Be Paranoid – At any time: servers can go down, DDOS can happen, NICs can become slow or fail!

Note: a few readers of of the live-tweets asked questions from Nashik and Bombay, and got them answered by Mukul. +1 for twitter. You should also start following.

Reblog this post [with Zemanta]

Attend proto.in from home – follow the live online coverage

Click on the logo to see all PuneTech articles about proto.in
Click on the logo to see all PuneTech articles about proto.in

Proto.in is in Pune today, and one of the ideas they are pushing this year, is live online coverage of the event. The idea is that while only 400 people can attend the event in person, many more should be able to follow the details online. With this in mind, this proto promises to the the most connected proto so far.

Here are the different ways in which you can follow proto online:

  • http://proto.in/live: is proto’s live portal where you can follow all the proto activity. It aggregates all the live tweets about #protodotin. Bloggers can also submit their live-blogs and selected ones will be put up on this page. You can download information about the companies that are presenting, and you can leave feedback for the companies.
  • @PuneTechLive will be live-tweeting the event. Unfortunately, twitter search does not pick up punetechlive’s tweets, so the proto.in/live page and the twitter search pages will not show you these tweets. So you have to follow punetechlive in twitter (or go to http://twitter.com/punetechlive and refresh periodically).
  • The hashtag for proto.in is “#protodotin”, so searching for that on twitter, or on technorati should give you the latest on what is going on.
  • On an experimental basis, PuneTech will be trying to videoblog. We will have short (1 or 2 minute) interviews with various interesting people throughout the data. Check PuneTech’s youtube page and refresh periodically.

Keep checking this page also, we’ll try to keep it updated with …umm… updates throughout the day.

PuneTech is also trying out live-video updates of proto.in. Check out this video:

Highlights of Proto.in Presentations

Here is what we feel were the best parts of proto so far:

  • Vardenchi motorcycles on stage. Awesome audience impact!
  • HyCa presenting a product based on very complicated chemical process in words that we understood.
  • EnglishSeekho demo – The product speaks for itself. No explanation needed
  • EnglishSeekho founder asking: we could be providing pesticide info to farmers, we could be providing information about contraception to rural girls, in a convenient and confidential setting, we could be providing life-changing, life-saving information at the right time, at the right price (maybe Rs. 5). Isn’t that better than spending time building websites that sell movie or airline tickets or books online?
  • TouchMagix demo of Magix 3D Sense. Someone on twitter pointed out – proto.in felt like TED for a moment!

Demo Tips for Startups

Based on what we saw at proto.in presentations, here are some tips for those who presenting their startup:

  • Read Jason Calcanis“How to demo your startup”. It’s a must read
  • Dress conservatively! You don’t want to draw attention to your dress. Definitely do NOT dress in a white suite and white-and-brown shoes.
  • If you have 30 minutes, then spending time on the educational and professional background of the team makes sense. If you have just 6 minutes, skip it. Go straight to your product.
  • You shouldn’t need to spend half your time motivating your product. Just show your product, and the audience should be able to see the motivation. Otherwise your product is not compelling enough (or you are pitching to the wrong audience)

Audience Reactions

These are reactions of Pune tech community regulars to the proto.in presentations:

Reblog this post [with Zemanta]

Optimization: A case study

(PuneTech is honored to have Dr. Narayan Venkatasubramanyan, an Optimization Guru and one of the original pioneers in applying Optimization to Supply Chain Management, as our contributor. I had the privilege of working closely with Narayan at i2 Technologies in Dallas for nearly 10 years.

PuneTech has published some introductory articles on Supply Chain Management (SCM) and the optimization & decision support challenges involved in various real world SCM problems. Who better to write about this area in further depth than Narayan!

For Dr. Narayan Venkatasubramanyan’s detailed bio, please click here.

This is the first in a series of articles that we will publish once a week for a month. For the full series of articles, click here.)

the following entry was prompted by a request for an article on the topic of “optimization” for publication in punetech.com, a website co-founded by amit paranjape, a friend and former colleague. for reasons that may have something to do with the fact that i’ve made a living for a couple of decades as a practitioner of that dark art known as optimization, he felt that i was best qualified to write about the subject for an audience that was technically savvy but not necessarily aware of the application of optimization. it took me a while to overcome my initial reluctance: is there really an audience for this after all, even my daughter feigns disgust every time i bring up the topic of what i do. after some thought, i accepted the challenge as long as i could take a slightly unusual approach to a “technical” topic: i decided to personalize it by rooting in a personal-professional experience. i could then branch off into a variety of different aspects of that experience, some technical, some not so much. read on …

background

the year was 1985. i was fresh out of school, entering the “real” world for the first time. with a bachelors in engineering from IIT-Bombay and a graduate degree in business from IIM-Ahmedabad, and little else, i was primed for success. or disaster. and i was too naive to tell the difference.

for those too young to remember those days, 1985 was early in rajiv gandhi‘s term as prime minister of india. he had come in with an obama-esque message of change. and change meant modernization (he was the first indian politician with a computer terminal situated quite prominently in his office). for a brief while, we believed that india had turned the corner, that the public sector companies in india would reclaim the “commanding heights” of the economy and exercise their power to make india a better place.

CMC was a public sector company that had inherited much of the computer maintenance business in india after IBM was tossed out in 1977. quickly, they broadened well beyond computer maintenance into all things related to computers. that year, they recruited heavily in IIM-A. i was one of an unusually large number of graduates who saw CMC as a good bet.

not too long into my tenure at at CMC, i was invited to meet with an mid-level manager in electronics & telecommunications department of the oil and natural gas commission of india (ONGC). the challenge he posed us was simple: save money by optimizing the utilization of helicopters in the bombay high oilfield.

the problem

the bombay high offshore oilfield, the setting of our story
the bombay high offshore oilfield, the setting of our story

the bombay high oilfield is about 100 miles off the coast of bombay (see map). back then, it was a collection of about 50 oil platforms, divided roughly into two groups, bombay high north and bombay high south.

(on a completely unrelated tangent: while writing this piece, i wandered off into searching for pictures of bombay high. i stumbled upon the work of captain nandu chitnis, ex-navy now ONGC, biker, amateur photographer … who i suspect is a pune native. click here for a few of his pictures that capture the outlandish beauty of an offshore oil field.)

movement of personnel between platforms in each of these groups was managed by a radio operator who was centrally located.

all but three of these platforms were unmanned. this meant that the people who worked on these platforms had to be flown out from the manned platforms every morning and brought back to their base platforms at the end of the day.

at dawn every morning, two helicopters, flew out from the airbase in juhu, in northwestern bombay. meanwhile, the radio operator in each field would get a set of requirements of the form “move m men from platform x to platform y”. these requirements could be qualified by time windows (e.g., need to reach y by 9am, or not available for pick-up until 8:30am) or priority (e.g., as soon as possible). each chopper would arrive at one of the central platforms and gets its instructions for the morning sortie from the radio operator. after doing its rounds for the morning, it would return to the main platform. at lunchtime, it would fly lunchboxes to the crews working at unmanned platforms. for the final sortie of the day, the radio operator would send instructions that would ensure that all the crews are returned safely to their home platforms before the chopper was released to return to bombay for the night.

the challenge for us was to build a computer system that would optimize the use of the helicopter. the requirements were ad hoc, i.e., there was no daily pattern to the movement of men within the field, so the problem was different every day. it was believed that the routes charted by the radio operator were inefficient. given the amount of fuel used in these operations, an improvement of 5% over what they did was sufficient to result in a payback period of 4-6 months for our project.

this was my first exposure to the real world of optimization. a colleague of mine — another IIM-A graduate and i — threw ourselves at this problem. later, we were joined yet another guy, an immensely bright guy who could make the lowly IBM PC-XT — remember, this was the state-of-the-art at that time — do unimaginable things. i couldn’t have asked to be a member of a team that was better suited to this job.

the solution

we collected all the static data that we thought we would need. we got the latitude and longitude of the on-shore base and of each platform (degrees, minutes, and seconds) and computed the distance between every pair of points on our map (i think we even briefly flirted with the idea of correcting for the curvature of the earth but decided against it, perhaps one of the few wise moves we made). we got the capacity (number of seats) and cruising speed of each of the helicopters.

we collected a lot of sample data of actual requirements and the routes that were flown.

we debated the mathematical formulation of the problem at length. we quickly realized that this was far harder than the classical “traveling salesman problem”. in that problem, you are given a set of points on a map and asked to find the shortest tour that starts at any city and touches every other city exactly once before returning to the starting point. in our problem, the “salesman” would pick and/or drop off passengers at each stop. the number he could pick up was constrained, so this meant that he could be forced to visit a city more than once. the TSP is known to be a “hard” problem, i.e., the time it takes to solve it grows very rapidly as you increase the number of cities in the problem. nevertheless, we forged ahead. i’m not sure if we actually completed the formulation of an integer programming problem but, even before we did, we came to the conclusion that this was too hard of a problem to be solved as an integer program on a first-generation desktop computer.

instead, we designed and implemented a search algorithm that would apply some rules to quickly generate good routes and then proceed to search for better routes. we no longer had a guarantee of optimality but we figured we were smart enough to direct our search well and make it quick. we tested our algorithm against the test cases we’d selected and discovered that we were beating the radio operators quite handily.

then came the moment we’d been waiting for: we finally met the radio operators.

they looked at the routes our program was generating. and then came the first complaint. “your routes are not accounting for refueling!”, they said. no one had told us that the sorties were long enough that you could run out of fuel halfway, so we had not been monitoring that at all!

Dhruv
ONGC’s HAL Dhruv Helicopters on sorties off the Mumbai coast. Image by Premshree Pillai via Flickr

so we went back to the drawing board. we now added a new dimension to the search algorithm: it had to keep track of fuel and, if it was running low on fuel during the sortie, direct the chopper to one of the few fuel bases. this meant that some of the routes that we had generated in the first attempt were no longer feasible. we weren’t beating the radio operators quite as easily as before.

we went back to the users. they took another look at our routes. and then came their next complaint: “you’ve got more than 7 people on board after refueling!”, they said. “but it’s a 12-seater!”, we argued. it turns out they had a point: these choppers had a large fuel tank, so once they topped up the tank — as they always do when they stop to refuel — they were too heavy to take a full complement of passengers. this meant that the capacity of the chopper was two-dimensional: seats and weight. on a full tank, weight was the binding constraint. as the fuel burned off, the weight constraint eased; beyond a certain point, the number of seats became the binding constraint.

we trooped back to the drawing board. “we can do this!”, we said to ourselves. and we did. remember, we were young and smart. and too stupid to see where all this was going.

in our next iteration, the computer-generated routes were coming closer and closer to the user-generated ones. mind you, we were still beating them on an average but our payback period was slowly growing.

we went back to the users with our latest and greatest solution. they looked at it. and they asked: “which way is the wind blowing?” by then, we knew not to ask “why do you care?” it turns out that helicopters always land and take-off into the wind. for instance, if the chopper was flying from x to y and the wind was blowing from y to x, the setting was perfect. the chopper would take off from x in the direction of y and make a bee-line for y. on the other hand, if the wind was also blowing from x to y, it would take off in a direction away from y, do a 180-degree turn, fly toward and past y, do yet another 180-degree turn, and land. given that, it made sense to keep the chopper generally flying a long string of short hops into the wind. when it could go no further because they fuel was running low or it needed to go no further in that direction because there were no passengers on board headed that way, then and only then, did it make sense to turn around and make a long hop back.

“bloody asymmetric distance matrix!”, we mumbled to ourselves. by then, we were beaten and bloodied but unbowed. we were determined to optimize these chopper routes, come hell or high water!

so back we went to our desks. we modified the search algorithm yet another time. by now, the code had grown so long that our program broke the limits of the editor in turbo pascal. but we soldiered on. finally, we had all of our users’ requirements coded into the algorithm.

or so we thought. we weren’t in the least bit surprised when, after looking at our latest output, they asked “was this in summer?”. we had now grown accustomed to this. they explained to us that the maximum payload of a chopper is a function of ambient temperature. on the hottest days of summer, choppers have to fly light. on a full tank, a 12-seater may now only accommodate 6 passengers. we were ready to give up. but not yet. back we went to our drawing board. and we went to the field one last time.

in some cases, we found that the radio operators were doing better than the computer. in some cases, we beat them. i can’t say no creative accounting was involved but we did manage to eke out a few percentage point of improvement over the manually generated routes.

epilogue

you’d think we’d won this battle of attrition. we’d shown that we could accommodate all of their requirements. we’d proved that we could do better than the radio operators. we’d taken our machine to the radio operators cabin on the platform and installed it there.

we didn’t realize that the final chapter hadn’t been written. a few weeks after we’d declared success, i got a call from ONGC. apparently, the system wasn’t working. no details were provided.

i flew out to the platform. i sat with the radio operator as he grudgingly input the requirements into the computer. he read off the output from the screen and proceeded with this job. after the morning sortie was done, i retired to the lounge, glad that my work was done.

a little before lunchtime, i got a call from the radio operator. “the system isn’t working!”, he said. i went back to his cabin. and discovered that he was right. it is not that our code had crashed. the system wouldn’t boot. when you turned on the machine, all you got was a lone blinking cursor on the top left corner of the screen. apparently, there was some kind of catastrophic hardware failure. in a moment of uncommon inspiration, i decided to open the box. i fiddled around with the cards and connectors, closed the box, and fired it up again. and it worked!

it turned out that the radio operator’s cabin was sitting right atop the industrial-strength laundry room of the platform. every time they turned on the laundry, everything in the radio room would vibrate. there was a pretty good chance that our PC would regress to a comatose state every time they did the laundry. i then realized that this was a hopeless situation. can i really blame a user for rejecting a system that was prone to frequent and total failures?

other articles in this series

this blog entry is intended to set the stage for a series of short explorations related to the application of optimization. i’d like to share what i’ve learned over a career spent largely in the business of applying optimization to real-world problems. interestingly, there is a lot more to practical optimization than models and algorithms. each of the the links below leads to a piece that dwells on one particular aspect.

optimization: a case study (this article)
architecture of a decision-support system
optimization and organizational readiness for change
optimization: a technical overview

About the author – Dr. Narayan Venkatasubramanyan

Dr. Narayan Venkatasubramanyan has spent over two decades applying a rare combination of quantitative skills, business knowledge, and the ability to think from first principles to real world business problems. He currently consults in several areas including supply chain and health care management. As a Fellow at i2 Technologies, he tackled supply chains problems in areas as diverse as computer assembly, semiconductor manufacturer, consumer goods, steel, and automotive. Prior to that, he worked with several airlines on their aircraft and crew scheduling problems. He topped off his days at IIT-Bombay and IIM-Ahmedabad with a Ph.D. in Operations Research from the University of Wisconsin-Madison.

He is presently based in Dallas, USA and travels extensively all over the world during the course of his consulting assignments. You can also find Narayan on Linkedin at: http://www.linkedin.com/in/narayan3rdeye

Reblog this post [with Zemanta]

Data management and data quality in business intelligence

I am liveblogging CSI Pune‘s lecture on Data Management and Data Quality in Business Intelligence, by Ashwin Deokar of SAS R&D Pune.

Huge amounts of data being generated these days. Different technologies (from databases, to RFID tags and GPS units), different platforms (PCs, servers, cellphones), different vendors. And all this data is often duplicated and inconsistent. All of this data needs to be collected in one place, and cleaned up?

Why? Three reasons:

  • Competitive business environment: With better, and more granular data, you can increase your profits, and reduce costs. For example, Walmart forcing RFID tags on all items that are supplied to them by suppliers – and tracking their locations for very accurate and up-to-date inventory control
  • Regulatory and Compliance requirements: e.g. US government has seriously strict data gathering and storage requirements for hospitals (HIPAA). If you can’t generate this data, you go to jail. That certainly reduces your ability to increase profits.
  • Adherence to Industry standards: If you can’t produce and consume data in the format that everybody else understands, you can’t play with the big boys

The key areas of study in this area are:

  • Data governance: Policies that govern the use of data in an organization. Done usually from the point of view of increasing data security (prevent hackers from getting in, prevent data from leaking out inadvertently), ensuring compliance with regulations, and optimal use of data for organizational growth.
  • Data architecture and design: Overall architecture – data storage, ETL process design, BI architecture, etc.
  • Database management: Since there are huge quantities of data, making a mistake here will pretty much doom the whole project to failure through overload. Which database? Optimizing the performance. Backup, recovery, integrity management, etc.
  • Data security: Who should have access? Which data needs to be kept private?
  • Data quality: Lots of work needed to ensure that there is a single version of the truth in the data warehouse. Especially difficult for non-transactional data (i.e. data that is not there in a database). e.g. Ashwin Deokar is the same as A.P. Deokar. Need fancy software that will do these transformations on the data.
  • Data Warehousing and Business Intelligence: What this component does is covered in a previous PuneTech article.

Data Quality. Why this is an important problem:

  • 96000 IRS tax refund cheques did not get delivered because of incorrect addresses.
  • An acquiring company, which acquired another company mainly for the customer base found that the acquisition was vastly overvalued – because the got 50% fewer customers than expected. Due to duplicates in the database.
  • A cable company lost $500,000 because a mislabeled shipment resulted in a cable being laid at a wrong location.
  • A man defrauded a CD company by taking their “introductory” offer (of free CDs) over 1600 times, by registering that many different accounts with different address. Since he did not really have that many different addresses, he managed to fool their computers by making slightly different address using minor changes like extra punctuation marks, fictitious apartment numbers, slightly different spellings, etc. Total damage: $250,000.

There is a process, combination of automated algorithms, and human assistance to help with improving data quality. And it is not just about duplicate data, or incorrect data. You also need to worry about missing data. And fetching it from the appropriate “other” sources.

What do you do?

  • Clean up your data by standardizing it using rules – have canonical spellings for names, addresses, e etc.
  • Use fancy algorithms to detect duplicates which are obvious by just looking at the strings. For example, “IBM” and “International Business Machines” do not look similar. But if they have the same address, same number of employees, etc., then you can say they are the same. (And you can have thresholds that adjust the sensitivity of this matching process.)
  • Use external data to clean up the data for de-duplication. For example, US Postal service publishes a CD of every valid address in the US. Businesses buy that CD and use that to convert all their address data to this standard format. That will result in major de-duplication.

SAS provides tools for all the steps in this process. And since it has all the pieces, it has the advantage of ensuring that there is a single meta-data repository for all the steps in this process – which is a huge advantage. SAS has the best ETL tools. It also exists in analytics, and BI. It has OLAP capabilities, but it really excels in business intelligence applications.

SAS R&D Pune has engineers working on various core products that are used in this area – meta-data, ETL, BI components. It also has a consulting group that helps companies deploy SAS products and use them – and that ends up working on all the parts of the data management / data quality process.