Category Archives: In Depth

Technology overview – Druvaa Continuous Data Protection

Druvaa, a Pune-based product startup that makes data protection (i.e. backup and replication) software targeted towards the enterprise market, has been all over the Indian startup scene recently. It was one of the few Pune startups to be funded in recent times (Rs. 1 crore by Indian Angel Network). It was one of the three startups that won the TiE-Canaan Entrepreneural challenge in July this year. It was one of the three startups chosen to present at the showcase of emerging product companies at the NASSCOMM product conclave 2008.

And this is not confined to national boundaries. It is one of only two (as far as I know) Pune-based companies to be featured in TechCrunch (actually TechCrunchIT), one of the most influential tech blogs in the world (the other Pune company featured in TechCrunch is Pubmatic).

Why all this attention for Druvaa? Other than the fact that it has a very strong team that is executing quite well, I think two things stand out:

  • It is one of the few Indian product startups that are targeting the enterprise market. This is a very difficult market to break into, both, because of the risk averse nature of the customers, and the very long sales cycles.
  • Unlike many other startups (especially consumer oriented web-2.0 startups), Druvaa’s products require some seriously difficult technology.

Rediff has a nice interview with the three co-founders of Druvaa, Ramani Kothundaraman, Milind Borate and Jaspreet Singh, which you should read to get an idea of their background, why they started Druvaa, and their journey so far. Druvaa also has a very interesting and active blog where they talk technology, and is worth reading on a regular basis.

The rest of this article talks about their technology.

Druvaa has two main products. Druvaa inSync allows enterprise desktop and laptop PCs to be backed up to a central server with over 90% savings in bandwidth and disk storage utilization. Druvaa Replicator allows replication of data from a production server to a secondary server near-synchronously and non-disruptively.

We now dig deeper into each of these products to give you a feel for the complex technology that goes into them. If you are not really interested in the technology, skip to the end of the article and come back tomorrow when we’ll be back to talking about google keyword searches and web-2.0 and other such things.

Druvaa Replicator

Overall schematic set-up for Druvaa Replicator
Overall schematic set-up for Druvaa Replicator

This is Druvaa’s first product, and is a good example of how something that seems simple to you and me can become insanely complicated when the customer is an enterprise. The problem seems rather simple: imagine an enterprise server that needs to be on, serving customer requests, all the time. If this server crashes for some reason, there needs to be a standby server that can immediately take over. This is the easy part. The problem is that the standby server needs to have a copy of the all the latest data, so that no data is lost (or at least very little data is lost). To do this, the replication software continuously copies all the latest updates of the data from the disks on the primary server side to the disks on the standby server side.

This is much harder than it seems. A simple implementation would simply ensure that every write of data that is done on the primary is also done on the standby storage at the same time (synchronously). This is unacceptable because each write would take unacceptably long and this would slow down the primary server too much.

If you are not doing synchronous updates, you need to start worrying about write order fidelity.

Write-order fidelity and file-system consistency

If a database writes a number of pages to the disk on your primary server, and if you have software that is replicating all these writes to a disk on a stand-by server, it is very important that the writes should be done on the stand-by in the same order in which they were done at the primary servers. This section explains why this is important, and also why doing this is difficult. If you know about this stuff already (database and file-system guys) or if you just don’t care about the technical details, skip to the next section.

Imagine a bank database. Account balances are stored as records in the database, which are ultimately stored on the disk. Imagine that I transfer Rs. 50,000 from Basant’s account to Navin’s account. Suppose Basant’s account had Rs. 3,00,000 before the transaction and Navin’s account had Rs. 1,00,000. So, during this transaction, the database software will end up doing two different writes to the disk:

  • write #1: Update Basant’s bank balance to 2,50,000
  • write #2: Update Navin’s bank balance to 1,50,000

Let us assume that Basant and Navin’s bank balances are stored on different locations on the disk (i.e. on different pages). This means that the above will be two different writes. If there is a power failure, after write #1, but before write #2, then the bank will have reduced Basant’s balance without increasing Navin’s balance. This is unacceptable. When the database server restarts when power is restored, it will have lost Rs. 50,000.

After write #1, the database (and the file-system) is said to be in an inconsistent state. After write #2, consistency is restored.

It is always possible that at the time of a power failure, a database might be inconsistent. This cannot be prevented, but it can be cured. For this, databases typically do something called write-ahead-logging. In this, the database first writes a “log entry” indicating what updates it is going to do as part of the current transaction. And only after the log entry is written does it do the actual updates. Now the sequence of updates is this:

  • write #0: Write this log entry “Update Basant’s balance to Rs. 2,50,000; update Navin’s balance to Rs. 1,50,000” to the logging section of the disk
  • write #1: Update Basant’s bank balance to 2,50,000
  • write #2: Update Navin’s bank balance to 1,50,000

Now if the power failure occurs between writes #0 and #1 or between #1 and #2, then the database has enough information to fix things later. When it restarts, before the database becomes active, it first reads the logging section of the disk and goes and checks whether all the updates that where claimed in the logs have actually happened. In this case, after reading the log entry, it needs to check whether Basant’s balance is actually 2,50,000 and Navin’s balance is actually 1,50,000. If they are not, the database is inconsisstent, but it has enough information to restore consistency. The recovery procedure consists of simply going ahead and making those updates. After these updates, the database can continue with regular operations.

(Note: This is a huge simplification of what really happens, and has some inaccuracies – the intention here is to give you a feel for what is going on, not a course lecture on database theory. Database people, please don’t write to me about the errors in the above – I already know; I have a Ph.D. in this area.)

Note that in the above scheme the order in which writes happen is very important. Specifically, write #0 must happen before #1 and #2. If for some reason write #1 happens before write #0 we can lose money again. Just imagine a power failure after write #1 but before write #0. On the other hand, it doesn’t really matter whether write #1 happens before write #2 or the other way around. The mathematically inclined will notice that this is a partial order.

Now if there is replication software that is replicating all the writes from the primary to the secondary, it needs to ensure that the writes happen in the same order. Otherwise the database on the stand-by server will be inconsistent, and can result in problems if suddenly the stand-by needs to take over as the main database. (Strictly speaking, we just need to ensure that the partial order is respected. So we can do the writes in this order: #0, #2, #1 and things will be fine. But #2, #0, #1 could lead to an inconsistent database.)

Replication software that ensures this is said to maintain write order fidelity. A large enterprise that runs mission critical databases (and other similar software) will not accept any replication software that does not maintain write order fidelity.

Why is write-order fidelity difficult?

I can here you muttering, “Ok, fine! Do the writes in the same order. Got it. What’s the big deal?” Turns out that maintaining write-order fidelity is easier said than done. Imagine the your database server has multiple CPUs. The different writes are being done by different CPUs. And the different CPUs have different clocks, so that the timestamps used by them are not nessarily in sync. Multiple CPUs is now the default in server class machines. Further imagine that the “logging section” of the database is actually stored on a different disk. For reasons beyond the scope of this article, this is the recommended practice. So, the situation is that different CPUs are writing to different disks, and the poor replication software has to figure out what order this was done in. It gets even worse when you realize that the disks are not simple disks, but complex disk arrays that have a whole lot of intelligence of their own (and hence might not write in the order you specified), and that there is a volume manager layer on the disk (which can be doing striping and RAID and other fancy tricks) and a file-system layer on top of the volume manager layer that is doing buffering of the writes, and you begin to get an idea of why this is not easy.

Naive solutions to this problem, like using locks to serialize the writes, result in unacceptable degradation of performance.

Druvaa Replicator has patent-pending technology in this area, where they are able to automatically figure out the partial order of the writes made at the primary, without significantly increasing the overheads. In this article, I’ve just focused on one aspect of Druvaa Replicator, just to give an idea of why this is so difficult to build. To get a more complete picture of the technology in it, see this white paper.

Druvaa inSync

Druvaa inSync is a solution that allows desktops/laptops in an enterprise to be backed up to a central server. (The central server is also in the enterprise; imagine the central server being in the head office, and the desktops/laptops spread out over a number of satellite offices across the country.) The key features of inSync are:

  • The amount of data being sent from the laptop to the backup server is greatly reduced (often by over 90%) compared to standard backup solutions. This results in much faster backups and lower consumption of expensive WAN bandwidth.
  • It stores all copies of the data, and hence allows timeline based recovery. You can recover any version of any document as it existed at any point of time in the past. Imagine you plugged in your friend’s USB drive at 2:30pm, and that resulted in a virus that totally screwed up your system. Simply uses inSync to restore your system to the state that existed at 2:29pm and you are done. This is possible because Druvaa backs up your data continuously and automatically. This is far better than having to restore from last night’s backup and losing all data from this morning.
  • It intelligently senses the kind of network connection that exists between the laptop and the backup server, and will correspondingly throttle its own usage of the network (possibly based on customer policies) to ensure that it does not interfere with the customer’s YouTube video browsing habits.

Data de-duplication

Overview of Druvaa inSync. 1. Fingerprints computed on laptop sent to backup server. 2. Backup server responds with information about which parts are non-duplicate. 3. Non-duplicate parts compressed, encrypted and sent.
Overview of Druvaa inSync. 1. Fingerprints computed on laptop sent to backup server. 2. Backup server responds with information about which parts are non-duplicate. 3. Non-duplicate parts compressed, encrypted and sent.

Let’s dig a little deeper into the claim of 90% reduction of data transfer. The basic technology behind this is called data de-duplication. Imagine an enterprise with 10 employees. All their laptops have been backed up to a single central server. At this point, data de-duplication software can realize that there is a lot of data that has been duplicated across the different backups. i.e. the 10 different backups of contain a lot of files that are common. Most of the files in the C:\WINDOWS directory. All those large powerpoint documents that got mail-forwarded around the office. In such cases, the de-duplication software can save diskspace by keeping just one copy of the file and deleting all the other copies. In place of the deleted copies, it can store a shortcut indicating that if this user tries to restore this file, it should be fetched from the other backup and then restored.

Data de-duplication doesn’t have to be at the level of whole files. Imagine a long and complex document you created and sent to your boss. Your boss simply changed the first three lines and saved it into a document with a different name. These files have different names, and different contents, but most of the data (other than the first few lines) is the same. De-duplication software can detect such copies of the data too, and are smart enough to store only one copy of this document in the first backup, and just the differences in the second backup.

The way to detect duplicates is through a mechanism called document fingerprinting. Each document is broken up into smaller chunks. (How do determine what constitutes one chunk is an advanced topic beyond the scope of this article.) Now, a short “fingerprint” is created for each chunk. A fingerprint is a short string (e.g. 16 bytes) that is uniquely determined by the contents of the entire chunk. The computation of a fingerprint is done in such a way that if even a single byte of the chunk is changed, the fingerprint changes. (It’s something like a checksum, but a little more complicated to ensure that two different chunks cannot accidently have the same checksum.)

All the fingerprints of all the chunks are then stored in a database. Now everytime a new document is encountered, it is broken up into chunks, fingerprints computed and these fingerprints are looked up in the database of fingerprints. If a fingerprint is found in the database, then we know that this particular chunk already exists somewhere in one of the backups, and the database will tell us the location of the chunk. Now this chunk in the new file can be replaced by a shortcut to the old chunk. Rinse. Repeat. And we get 90% savings of disk space. The interested reader is encouraged to google Rabin fingerprinting, shingling, Rsync for hours of fascinating algorithms in this area. Before you know it, you’ll be trying to figure out how to use these techniques to find who is plagiarising your blog content on the internet.

Back to Druvaa inSync. inSync does fingerprinting at the laptop itself, before the data is sent to the central server. So, it is able to detect duplicate content before it gets sent over the slow and expensive net connection and consumes time and bandwidth. This is in contrast to most other systems that do de-duplication as a post-processing step at the server. At a Fortune 500 customer site, inSync was able reduce the backup time from 30 minutes to 4 minutes, and the disk space required on the server went down from 7TB to 680GB. (source.)

Again, this was just one example used to give an idea of the complexities involved in building inSync. For more information on other distinguishinging features, check out the inSync product overview page.

Have questions about the technology, or about Druvaa in general? Ask them in the comments section below (or email me). I’m sure Milind/Jaspreet will be happy to answer them.

Also, this long, tech-heavy article was an experiment. Did you like it? Was it too long? Too technical? Do you want more articles like this, or less? Please let me know.

Related articles:

Enhanced by Zemanta

Interview with Mayank Jain – Co-founder of ApnaBill.com


It’s the middle of the night, and your prepaid phone runs out of credits, and you need to make a call urgently. Don’t you wish that you could re-charge your prepaid mobile over the internet? Pune-based startup ApnaBill allows you to do just that. Fire up a browser, select your operator (they have partnerships with all major service providers), pay from your bank account or by credit card, and receive an SMS/e-mail with the recharge PIN. Done. They have extended this model to satellite TV (TataSky, Dish), with more such coming out of the pipeline.

PuneTech interviewed co-founder and lead developer Mayank Jain where he talks about various things, from technical challenges (does your hosting provider have an upper limit on number of emails you can send out per day?), to unexpected problems that will slow down your startup (PAN card!), and advice for other budding entrepreneurs (start the paperwork for registration/bank accounts as soon as possible).

On to the interview.

Overview of ApnaBill:

Simply put, ApnaBill.com is a online service for facilitating Prepaid and Postpaid Utility Bill payments.

Available now, are Prepaid utility bill payments like prepaid mobile recharge and prepaid vouchers for Tata Sky, World Phone, Dish TV etc.

Organizationally, ApnaBill.com is an offshoot of Four Fractions. It aims at being the single point of contact between service providers and customers, thereby minimizing transactional costs. The benefit of this is directly passed onto our customers as we do NOT charge any transaction costs from our customers. Its an ApnaBill.com policy and would be applicable to all of our product line.

Apart from regular Utility Bill Payments, we are also exploring some seemingly blue ocean verticals which have not been targeted by the online bill payment sector – yet.

Monetization strategy:

We have managed to make our business model such that despite absorbing the transactional cost, we’ll be able to make profits. They would definitely be low but the sheer amount of transactions (which we would attract because of no-transaction-charge policy) would put our figures in positive direction.

Moreover, profit generated from transactions is just one revenue source. Once we have a good traction, our advertisement revenue sources would also become viable.

We are definitely looking at a long term brand building.

Technical Challenges – Overview

Contrary to popular belief, technology is generally the simplest ingredient in a startup – specially because the startup can generally excercise full control over how it is used and deployed. And with increasingly cheaper computing resources, this space is becoming even more smoother.

However, following problems were a real challenges which we faced and solved.

  • Being a web 2.0 startup, we faced some major cross browser issues.
  • Mail capping limits for shared hosting accounts.
  • Minimizing client side internet connectivity and page display speeds
  • Database versioning.

Thankfully, ApnaBill.com is running Ruby on Rails under the hood – and all the solutions we designed, just got fit into the right grooves.

Technical Challenges – Details

Ruby on Rails a one of the best framework a web developer can ask for. All the solutions to the above problems just come bundled with it.

Prototype javascript library solves a lot of common cross browser issues. To completely eradicate them, an additional PNG hack from Pluit Solutions and IE7.js which lets IE6 browser render PNG images which have transparency. Once you have sanity in terms of cross browser issues, you can actually start focussing on feature development.

To overcome mail capping limits for shared hosts, we devised our own modules which would schedule mails if they were crossing the mail caps. However, we later discovered that there’s a great Ruby gem – ar_mailer to do just that. We are planning to make the shift.

Minimizing client side page load speeds was an interesting problem. We used Yahoo’s YSlow to detect where we lagged interms of page load speeds, introduced the necessary changes like moving JS to bottom of pages, CSS to the top, etc. which helped us alot in reducing the load time. Yahoo also has a JS minifier – YUI Compressor – which works great in reducing javascript files to upto 15%. We also deployed a dumb page-name based JS deployment scheme which simply blocks any javascript to load up on some particular pages (for example the homepage). This helps us in ultra fast page loads.

If you see our homepage, no JS loads up when the page is loading up. However, once the page is loaded, we initiate a delayed JS load which renders our news feed in the end.

Database versioning is an inbuilt feature in Rails. We can effectively revert back to any version of ApnaBill.com (in terms of functionality) with standard Rails framework procedures.

Non-technical challenges:

Integrating various vendors and services was visibly the biggest challenge we overcame during the (almost) 9 months development cycle of ApnaBill.com.

Getting the organization up and running was another big challenge. The paperwork takes a lot of valuable time – which if visioned properly, can be minimized to a manageable amount.

Payment Gateways are a big mess for startups. They are costly, demand huge chunks of money for security deposits and have very high transaction costs. Those who are cheap – lack even the basic courtesy and quality of service. Sooner or later, the backbone of your business becomes the single most painful factor in your business process – specially when you have no control over its functioning.

Thankfully, there are a few payment gateways which are above all of this. We hope to make an announcement soon.

The founders of ApnaBill - from left, Mayank, Samir and Sandeep.
The founders of ApnaBill - from left, Mayank, Sameer and Sandeep.

The process of founding ApnaBill:

When and how did you get the idea of founding ApnaBill? How long before you finally decided to take the plunge and start in earnest? What is your team like now?

The story described at http://www.fourfractions.com/main/our-story is very true.

In June 2007, one of the founding members of Four Fractions saw a friend of his, cribbing about how he cannot recharge his prepaid mobile phone from the comforts of his home. He had to walk about 1 km to reach the nearest local shop to get his phone connection recharged.

This idea caught the founder’s attention and he, along-with others formed Four Fractions on 20th December ’07 to launch ApnaBill.com as one of their flagship products.

ApnaBill.com was opened for public transactions on 15th June 08. The release was a birthday present to ApnaBill.com’s co-founder’s mom.

Our team is now 5 people strong, spread across New Delhi and Pune. As of now, we are self funded and are actively looking for seed funding.

What takes most of the time:

As I mentioned earlier, getting various services integrated took most of the time. If we had to just push out our own product (minus all collaborations), it would have taken us less than 3 months.

There was this funny thing that set us back by almost 1 month…

We applied for a PAN card for Four Fractions. First, our application somehow got lost in the process. Then someone in the government department managed to put down our address as 108 when it was supposed to be 10 B (8 and B are very similar looking).

None of us ever envisioned this – but it happened. We lost a precious month sorthig this issue out. And since all activities were dependent on official papers, other things like bank accounts, payment gateway intgrations etc also got pushed back. But I am glad, we sorted this out in the end. Our families supported us through this all the way.

Every process like creating Bank accounts, getting PAN cards etc are still very slow and manual in nature. If we can somehow improve on them, the ecosystem can prove very helpful for budding startups.

About the co-founders:

There are 3 CoFounders for ApnaBill.com

Sameer Jain: Sameer is the brain behind our revenue generation streams and marketing policies. He is a Post Grad from Delhi University in International Marketing.

Sandeep Kumar: Sandeep comes from billing (technical) background. With him, he has brought vast knowledge about billing processes and solid database knowhow.

Myself (Mayank Jain): I come from desktop application development background. I switched to Ruby on Rails almost 18 months ago – and since then, I am a devoted Ruby evangelist and Rails developer.

Luckily, we have a team which is just right. We have two polarizing ends – Sandeep and Sameer. One of them is constantly driving organization to minimizing costs while the other is driven towards maximizing revenue from all possible sources. I act as a glue between both of them. Together, we are constantly driving the organization forward.

About selection for proto.in:

Proto.in was the platform for which we were preparing for from almost 2 months. We had decided our launch dates in such a way that we would launch and be LIVE just in time for Proto.in.

Being recognized for your efforts is a big satisfaction.

Proto.in was also a huge learning experience. Interacting directly with our potential users gave us an insight on how they percieve ApnaBill.com and what they want out of it. We also came across some interesting revenue generation ideas when interacting with the startup veterans at Proto.

A big thanks to Vijay Anand and the Proto Team.

Advice for other potential entrepreneurs:

There are a lot of people who are currently doing a job somewhere, but who harbor a desire to start something on their own. Since you have already gone that route, what suggestions would you have for them?

Some tips I would like to share with my peer budding entrepreneurs…

  • Focus, focus and focus!
  • If you are an internet startup, book your domain before anything and get the right hosting partner.
  • Start the paperwork for firm/bank accounts registration as soon as possible.
  • Write down your financial/investment plan on paper before you start. Some plan is way better than a no plan!
  • Adopt proper development process for the tech team. With a process in place, development activities can be tracked rationally.
  • Get someone to manage your finances – outsourcing is a very attractive option.

The most important factor for a startup besides anything else – is to keep fighting during the adverse scenarios. Almost everything would spring into your face as a problem. But a team which can work together to find a solution for it – makes it to the end.

Just remember, more than the destination, it is the journey that would count.

Blog links:

Reblog this post [with Zemanta]

PMC launches participatory budgeting

The Pune Municipal Corporation has a scheme to include citizens suggestions in the budget for the year. Anybody who has an idea for work that can be carried out in their ward/locality can download a form, fill it out and submit it at their ward office / nearest multi-utility citizen kiosk location / citizen facilitation centre.

Only projects that pertain to a neighborhood or locality and do not involve city level infrastructure may be suggested; the suggested work has to be under ward office purview. The suggested project cost should preferably be within Rs. 5 Lakhs. Examples of kinds of work that you can suggest are: Pavements / Water Supply / Drainage / Bus stop (in consultation with PMT) / Parks and Gardens (only repair works) / Bhawan (only repair works) / Public Toilets / Lights (Road / Traffic) / Roads (only Resurfacing). Example of kinds of work that are NOT acceptable are: Pedestrian Bridge / Speed Breakers (prohibited by Supreme Court) / Garden (new provision) / constructions on the land not owned by PMC.

Deadline for the form / maps submission is 10th September 2008.

All citizens should take a copy of the submitted project form to the office and make sure to get a form ID and ‘receipt’ of the submission.

Obviously, not all the suggestions will be accepted. However, various groups and NGOs will be monitoring the process to try and ensure that at the very least, information about why projects were accepted or rejected will be made available to the public after the budgeting process is over.

For more information, see the entry for participatory budgeting in the Pune Government wiki. In general, the Pune Government wiki is a very interesting place to hang out. It is just a few weeks old, and there is already a lot of interesting information already uploaded there.

There is only partial “tech” content in this post, since technology is being used to disseminate the information (the wiki, and downloadable forms). There are also plans afoot to make some of the submitted proposals browseable on a map of Pune, with the help of Pune-based SadakMap. However, forms still have to be submitted in person – that process has not gone online yet. Hopefully, that can happen next year.

Please help make this initiative a success. Forward this article to people you know who might be interested.

Reblog this post [with Zemanta]

How social|median is developed out of Pune

Jason Goldberg is a serial entrepreneur, who founded and headed Jobster, and who is now on to his next startup, social|median, a social news website. In a long article on his blog, he talks about what lessons he learnt from his first startup, and what he is doing differently in social|median as a result. The whole article is very interesting, and I would say, a must read for budding entrepreneurs (Update: unfortunate, the website seems to be gone, and the original article is no longer available). However, most interesting to me is the fact that, although Jason is based in New York, his entire development team is in Pune, with True Sparrow Systems.

He talks about why he decided that development of social|median:

  • Second […] we decided to build on a tight budget. Now, don’t get me wrong, I’m not talking cheap as in 1 guy in a dorm room. I’m talking low budget as in constraining the company to <$40k/month of burn in the first 4 months and then only taking it beyond that to about $60k/month once we had shown some early initial traction. The notion here was that spending our cash is the same as spending our equity. The more we spend early on, the less the company will be worth in the long run.
  • Maintaining a burn like that forced us to think outside the box when it came to staffing the company. To put a $40k/month burn in perspective, that would get you about 3 employees at most fully loaded with office space in New York (if you’re lucky). I remember interviewing a total rock star CTO-type in January in NYC and walking away thinking there went all my initial funding and that’s just for 1 guy. Instead, we have run the company out of my apartment in New York and from our development center in Pune, India. I’m the only U.S. based socialmedian employee (besides our awesome intern Scott who joined us for the summer from Syracuse and who has been a god-send). The rest of our team is based in Pune, India. We started with 6 fulltime socialmedian employees in Pune and have since grown the socialmedian development team to 11 fulltime employees in Pune.

Finding the right company to outsource to is another interesting story.

Jason first found out about True Sparrow Systems when he saw a facebook application they had developed. He felt that the application had been designed very well, by someone who had not just done a quick and dirty job to jump on the latest bandwagon (social networking! yay!), but instead someone who had spent time thinking about the application and its users. Based on this he decided to go with True Sparrow Systems.

However, this is not your usual outsourcing relationship. Jason has set-up things rather differently from most other companies:

A few notes about working with an offshore team. If you’re gonna do it, do it right. What I mean by that is that I’ve seen it done wrong so many times it’s sickening. Folks in the U.S. all too often have this mistaken belief that there are these inexpensive coders outside the U.S. who are just on call and ready to write code based on specs. That’s a recipe for disaster. In order for software to be developed well, it takes a team that is adept at planning and strategizing and problem solving together. It takes a team that feels like a team and who is passionate about the product they are creating. It takes a team who truly feels like they are building their product not someone else’s.

So, we decided to set up things differently at socialmedian. First, our decision to go offshore was certainly based on costs, but it was equally based on abilities and mutual respect. I had worked with the future socialmedian team in Pune before socialmedian on other projects and only chose to work with them on socialmedian because I was impressed with their thought process as much as their work product. We chose to work with them because they know how to solve problems and how to figure out how to respond to customer/user needs. And, they passed the most important test of all, an earnest early interest in the problem we are trying to solve at socialmedian and fantastic ideas on how to tackle the problem.

Second, I personally committed to travel to Pune, India nearly monthly for the first year of socialmedian (I’ve been there 6 times thus far in 2008 and am headed back in a couple of weeks). The logic here was that if the team was there, I, as the lead product manager, should be there too. As per our hunch, we learned early on that in-person time was critical for planning. As such, we have evolved into this regular cadence wherein for 1 week out of every month we plan together in person, and then for 3 weeks we are more tactical as our interactions are over skype. Sure, all that travel is tough (ask my spouse who hates me for it), but it has proven to be very effective for us at socialmedian.

Third, we have made our Indian team shareholders in socialmedian, so we are one company building one product. It’s an offshore situation, not an outsourcing relationship.

Of course, this model is not for everyone, but it has worked well for us thus far. Mostly because we have an awesome team joined together working on socialmedian and we’ve created an environment where it’s all about our users and the product, and the fact that we are thousands of miles away from each other is just a fact of life, not a problem. If I had to start over today I’d choose the same team 10 out of 10 times to work with.

A lot of this is enabled by the tools:

In case you were wondering, here’s the process and tools/services we use at socialmedian to mange our New York – India operations. As noted, I travel to Pune for at least 1 work-week out of every 5 work -weeks. We ship code 3x per week within 3-4 week development milestones. We use TRAC (open source bug tracking tool) to manage bugs and feature requests. We use basecamp to share files. We talk on Skype when I’m not in Pune pretty much 6x per week from 8am Eastern Time to around 11am.

Read the whole article for a whole lot of other (non-Pune related) advice. It is long, but worth the trouble, especially if you dream of having your own startup.(Sorry, the article is gone, but here is a copy from the Wayback Machine (thanks Pragnesh))

Zemanta Pixie

Why I chose Pune for my startup

This article was written by Anthony Hsiao in an e-mail thread over at Pune OpenCoffee Club‘s mailing list, and is reproduced here with permission. Anthony is a co-founder of Pune-startup Entrip. Anthony, Nick and Adil, who were in London when they decided to start Entrip, moved to Pune to actually make it happen, for the reasons given in this article.

I am putting my money on Pune as India’s startuphub.

When we first decided to just head to India to start work on our startup
(we’re London based), we were only heard of Bangalore as the next IT hub,
and Hyderabad as upcoming. We didn’t want to go to a megacity like Delhi or
Mumbai, but more of a Tech City. Then one of my best friends from
Switzerland, he’s Indian, recommended we should have a look at Pune.

After some research, Pune met exactly the kind of requirements that we were
looking for, or at least, it fared better on a decision analysis than did
Bangalore or Hyderabad: It’s a college city with lots of young, educated and
(as we hoped) creative people, not too large, IT focused, not too expensive
(at the time). Another big factor was the fact that we perceived Bangalore
and Hyderabad as HUGE IT OUTSROURCING CENTERS, cities of modern factories,
where modern labourers were robotting away, while Pune, as educational
center, appeared to offer a different perspective. My friend also told me
that the girls in Pune were very ‘interesting’, but that is just a side note
– but as true geek this didn’t play much of a role *wink*.

When we got to Pune, I think one of the first things that struck me was
actually the apparent lack of creativity, lack of spirit that we are used to
from university towns where students just ‘do things’, the lack of ambition
just for the beauty of it and the seemingly only motivation to do anything –
working for some big company with some name, earning bucks.

It took some time for me to understand where a) people were coming from (not
everybody has parents that would happily support your little startup
adventures if they went wrong) b) the cultural and in large parts traditonal
context that young people had to operate from within, and c) that in fact it
strongly depends on the kind of circles that we moved within to get these
impressions. I was a bit disappointed, and am still, everytime I heard
someone ask for what company I work for rather than for what I actually do,
but my criticism was challenged by a different world that I later
discovered: the startup community in Pune.

Yes, a lot of people in Pune are neither creative nor ambitious or daring.
But that’s ok, every place in the world has a broad layer of such people. In
fact, they are vital for the ecosystem as well. But not every place has a
vibrant, connected and active startup community as Pune.

Instead of ‘cannot’, ‘big salary’ or ‘I don’t know why’, I suddenly heard ‘I
think I can, and I will try’, ‘big opportunity’ and ‘Because it’s cool’. A
180 degree turn from a lot of the students or ‘desparate’ professionals I’ve
met! What is this newly discovered startup community?

Looking at it now as I write this, I would say that Pune has what is
necessary to attract ‘the right kind of people’, young, creative,
adventurous, willing to ‘do things’ – the stuff that startups are about (in
large parts). It certainly worked for us or fellow foreigners trying it out
in Pune as well as the countless NRIs or long term expats that come back
with a more open mind and lots of experience. That, then, is a positive
feedback loop for the composition of the city and the community.

So it’s the people of Pune, or the startup community to be more precise,
which I think send out a strong message. Of course I would like to play an
active part in shaping this still relatively young community, and I think so
does everybody else. There is this community sense, where people communicate
AND understand each other, go through SIMILAR experiences and face SIMILAR
hurdles as entrepreneurs (in IT-outsroucing-India), want to help each OTHER
and want to rise TOGETHER, as a community, so that one day we can all say it
happened in Pune, and we were  there.

So what message does Pune send out? I think it says ‘we are Pune, and we
have what it takes to be India’s silicon valley’.

Best regards ,

Anthony – a foreigner.

Other thoughts: Maybe I am painting a bit of a biased picture, and of course
there is still a lot of work to be done. But the composition of Pune is
there, the community is there (and growing), and the will and shared spirit
seems to be there. Now the change just needs to happen.

I would attribute a great part of this spirit or feeling to the fact that
Pune is relatively small, or at least has been. People are closer, and know
each other. As such, I see the creation of huge IT parks all over the place
OUTSIDE the city/in satellite towns, as a potential dilution to the Pune
startup community, which I hope we can somehow fend off.

Of course, one might be able to craft a similar description about other
cities line Bangalore, but I would say that the unique composition of
colleges and companies are a great edge. Also, at least in the past, the
ratio of ‘large companies’ to ‘small companies’, I’d guess, is smaller in
Pune than in other places – or at least was. If everybody around a young
graduate is going to try to work for the next big company that pays stellar
salaries, of course, startups would lose the war for talent. As such, the
intensifying competition of large companies for good people is another
threat to look out for, but one which, I think, can be addressed by a strong
and visible startup community.

I don’t want to get into politics and policies (at least not in an email
thread!)

Reblog this post [with Zemanta]

Revenue Optimization – Increasing profits for free

Air India Boeing 747-400. The Government of In...Image via Wikipedia

Business intelligence and analytics company SAS announced a few days back that it is acquiring revenue optimization company IDeaS. Both, SAS and IDeaS have major development centers in Pune. This article gives an overview of the software that IDeaS sells.

I always found airfares very confusing and frustrating (more so in the US than in India). A roundtrip sometimes costs less than a one-way ticket. Saturday night stayovers cost less. The same flight can cost $900 or $90 (and those two guys might end up sitting on adjacent seats). The reason we see such bizarre behavior is because of a fascinating field of economics called Revenue Optimization.

IDeaS software (which has a major development center in Pune) provides software and services that help companies (for example hotels) to determine what is the best price to charge customers so as to maximize revenue. The technology called pricing and revenue optimization – also called Revenue Management (RM) – focuses on how an organization should set and update their pricing and product availability across its various sales channels in order to maximize profitability.

First it is necessary to understand some basic economic concepts.

Market Segmentation

If you don’t really know what market segmentation is, then I would highly recommend that you read Joel Spolsky‘s article Camels and Rubber Duckies. It is a must read – especially for engineers who haven’t had the benefit of a commerce/economics education. (And also for commerce grads who did not pay attention in class.)

Here is the very basic idea of market segmentation

  • Poor Programmer want to take a 6am Bombay-Delhi flight to attend a friend’s wedding. He is willing to pay up to Rs. 4000 for the flight. If the price is higher, he will try alternatives like a late-night flight, or going by train.
  • On another occassion, the same Poor Programmer is being sent to Delhi for a conference by his company. In this case, he doesn’t care if the price is Rs. 8000, he will insist on going by the 6am flight.

If the airline company charges Rs. 8000 to all customers, a lot of its seats will go empty, and it is losing out on potential revenue. If it charges Rs. 4000 for all seats, then all seats will fill up quickly, but it is leaving money on the table since there were obviously some customers who were willing to pay much more.

The ideal situation is to charge each customer how much she is willing to pay, but that involves having a salesperson involved in every sale, which has its own share of problems. Better is to partition your customers into two or three segments and charge a different price for each.

Guessing a customer’s market segment

Unfortunately, customers do not come with a label on their forehead indicating the maximum amount they are willing to pay. And, even the guy paying Rs. 8000 feels cheated if he finds out that someone else paid Rs. 4000 for the same thing. This is where the real creativity of the marketers comes in.

The person in the Rs. 4000 market segment (leisure travel) books well in advance and usually stays over the weekend. The person in the Rs. 8000 market segment (business travel) books just a few days before the flight, and wants to be back home to his family for the weekend. This is why the airlines have low prices if you book in advance, and why airlines (at least in the US) have lower prices in case of a weekend stayover.

This also keeps the rich customer from feeling cheated. “Why did I pay more for the same seat?” If you try saying “Because you are rich,” he is going to blow his top. But instead if you say, “Sir, that’s because this seat is not staying over the weekend,” the customer feels less cheated. Seriously. That’s how psychology works.

Exercise for the motivated reader – figure out how supermarket discount coupons work on the same principle.

Forecasting demand

This is the key strength of IDeaS revenue optimization software.

You need to guess how many customers you will get in each market segment and then allocate your reservations accordingly. Here is an excerpt from their excellent white-paper on Revenue Management:

The objective of revenue management is to allocate inventory among price levels/market segments to maximize total expected revenue or profits in the face of uncertain levels of demand for your service.

If we reserve a unit of capacity (an airline seat or a hotel room or 30 seconds of television advertising time) for the exclusive use of a potential customer who has a 70 percent probability of wanting it and is in a market segment with a price of $100 per unit, then the expected revenue for that unit is $70 ($100 x 70%). Faced with this situation 10 times, we would expect that 7 times the customer would appear and pay us $100 and 3 times he would fail to materialize and we would get nothing. We would collect a total of $700 for the 10 units of capacity or an average of $70 per unit.

Suppose another customer appeared and offered us $60 for the unit, in cash, on the spot. Should we accept his offer? No, because as long as we are able to keep a long-term perspective, we know that a 100 percent probability of getting $60 gives us expected revenue of only $60. Over 10 occurrences we would only get $600 following the “bird in the hand” strategy.

Now, what if instead the customer in front of us was offering $80 cash for the unit. Is this offer acceptable to us? Yes; because his expected revenue (100% x $80 = $80) is greater than that of the potential passenger “in the bush”. Over 10 occurrences, we would get $800 in this situation or $80 per unit.

If the person offers exactly $70 cash we would be indifferent about selling him the unit because the expected revenue from him is equal to that of the potential customer (100% x $70 = 70% x $100 = $70). The bottom line is that $70 is the lowest price that we should accept from a customer standing in front of us. If someone offers us more than $70, we sell, otherwise we do not. This is one of the key concepts of Revenue Management:

We should never sell a unit of capacity for less than we expect to receive for it from another customer, but if we can get more for it, the extra revenue goes right to the bottom line.

What would have happened in this case if we had incorrectly assumed that we “knew” certainty that the potential $100 customer would show up (after all, he usually does!). We would have turned away the guy who was willing to pay us $80 per unit and at the end of 10 occurrences, we would have $700 instead of $800.

Thus we can see that by either ignoring uncertainty and assuming that what usually happens will always happen, or by always taking “the bird in the hand” we are afraid to acknowledge and manage everyday risk and uncertainty as a normal part of doing business, we lose money.

The Expected Marginal Revenue

The previous section gave an idea of the basic principle to be used in revenue maximization. In practice, the probability associated with a particular market segment is not fixed, but varies with time and with the number of units available for sale.

One of the key principles of revenue management is that as the level of available capacity increases, the marginal expected revenue from each additional unit of capacity declines. If you offer only one unit of capacity for sale the probability of selling it is very high and it is very unlikely that you will have to offer a discount in order to sell it. Thus, the expected revenue estimate for that first unit will be quite high. However, with each additional unit of capacity that you offer for sale, the probability that it will be sold to a customer goes down a little (and the pressure to discount it goes up) until you reach the point where you are offering so much capacity that the probability of selling the last additional unit is close to zero, even if you practically give it away. At this point the expected revenue estimate for that seat is close to zero ($100 x 0% = $0). Economists call this phenomenon the Expected Marginal Revenue Curve, which looks something like this:

EMR Curve

There is an EMR curve like this for each market segment. Note that it will also vary based on time of the year, day of the week (i.e. whether the flight is on a weekend or not), and a whole bunch of other parameters. By looking at historical data, and correlating it with all the interesting parameters, Revenue Management software can estimate the EMR curve for each of your market segments.

Now, for any given sale, first plot the EMR curves for the different market segments you have created, and find the point at which the rich guys’ curve crosses and goes under the poor guys’ curve. See the number of units (on the x-axis) for this crossover point and sell to the poor guy only if the number of units currently remaining is less than this.

EMR curves crossing over

Applications

Revenue Optimization of this type is applicable whenever you are in a business that has the following characteristics:

  • Perishable inventory (seats become useless after the flight takes off)
  • Relatively fixed capacity (can’t add hotel rooms to deal with extra weekend load)
  • High fixed costs, low variable costs (you’ve got to pay the air-hostess whether the flight is full or empty)
  • Advance reservations
  • Time variable demand
  • Appropriate cost and pricing structure
  • Segmentable markets

Due to this, almost all major Hotel, Car Rental Agencies, Cruise Lines and Passenger Railroad firms have, or are developing, revenue management systems. Other industries that appear ripe for the application of revenue management concepts include Golf Courses, Freight Transportation, Health Care, Utilities, Television Broadcast, Spa Resorts, Advertising, Telecommunications, Ticketing, Restaurants and Web Conferencing.

Revenue Management software helps you with handling seasonal demand and peak/off-peak pricing, determining how much to overbook, what rates to charge in each market segment. It is also useful in evaluating corporate contracts and market promotions. And there are a whole bunch of other issues that make the field much more complicated, and much more interesting. So, if you found this article interesting, you must check out IDeaS white-paper on Revenue Management – it is very well written, and has many more fascinating insights into this field.

Reblog this post [with Zemanta]

Cloud Computing and High Availability

This article discussing strategies for achieving high availability of applications based on cloud computing services is reprinted with permission from the blog of Mukul Kumar of Pune-based ad optimization startup PubMatic

Cloud Computing has become very widespread with startups as well as divisions of banks, pharmaceuticals companies and other large corporations using them for computing and storage. Amazon Web Services has led the pack with it’s innovation and execution, with services such S3 storage service, EC2 compute cloud, and SimpleDB online database.

Many options exist today for cloud services, for hosting, storage and application hosting. Some examples are below:

Hosting Storage Applications
Amazon EC2 Amazon S3 opSource
MOSSO Nirvanix Google Apps
GoGrid Microsoft Mesh Salesforce.com
AppNexus EMC Mozy
Google AppEngine MOSSO CloudFS
flexiscale

[A good compilation of cloud computing is here, with a nice list of providers here. Also worth checking out is this post.]

The high availability of these cloud services becomes more important with some of these companies relying on these services for their critical infrastructure. Recent outages of Amazon S3 (here and here) have raised some important questions such as this – S3 Outage Highlights Fragility of Web Services and this.

[A simple search on search.twitter.com can tell you things that you won’t find on web pages. Check it out with this search, this and this.]

There has been some discussion on the high availability of cloud services and some possible solutions. For example the following posts – “Strategy: Front S3 with a Caching Proxy” and “Responding to Amazon’s S3 outage“.

Here I am writing of some thoughts on how these cloud services can be made highly available, by following the traditional path of redundancy.

[Image: Basic cloud computing architectures config #1 to #3]

The traditional way of using AWS S3 is to use it with AWS EC2 (config #0). Configurations such as on the left can be made to make your computing and storage not dependent on the same service provider. Config #1, config #2 and config #3 mix and match some of the more flexible computing services with storage services. In theory the compute and the storage can be separately replaced by a colo service.

[Image: Cloud computing HA configuraion #4]

The configurations on the right are examples of providing high availability by making a “hot-standby”. Config #4 makes the storage service hot-standby and config #5 separates the web-service layer from the application layer, and makes the whole application+storage layer as hot-standby.

A hot-standby requires three things to be configured – rsync, monitoring and switchover. rsync needs to be configured between hot-standby servers, to make sure that most of the application and data components are up to date on the online-server. So for example in config #4 one has to rsync ‘Amazon S3’ to ‘Nirvanix’ – that’s pretty easy to setup. In fact, if we add more automation, we can “turn-off” a standby server after making sure that the data-source is synced up. Though that assumes that the server provisioning time is an acceptable downtime, i.e. the RTO (Recovery time objective) is within acceptable limits.

[Image: Cloud computing Hot Standby Config #5]
This also requires that you are monitoring each of the web services. One might have to do service-heartbeating – this has to be designed for the application, this has to be designed differently for monitoring Tomcat, MySQL, Apache or their sub-components. In theory it would be nice if a cloud computing service would export APIs, for example an API for http://status.aws.amazon.com/ , http://status.mosso.com/ or http://heartbeat.skype.com/. However, most of the times the status page is updated much later after the service goes down. So, that wouldn’t help much.

Switchover from the online-server/service to the hot-standby would probably have to be done by hand. This requires a handshake with the upper layer so that requests stop and start going to the new service when you trigger the switchover. This might become interesting with stateful-services and also where you cannot drop any packets, so quiscing may have to be done for the requests before the switchover takes place.

[Image: Cloud computing multi-tier config #6]
Above are two configurations of multi-tiered web-services, where each service is built on a different cloud service. This is a theoretical configuration, since I don’t know of many good cloud services, there are only a few. But this may represent a possible future, where the space becomes fragmented, with many service providers.

[Image: Multi-tier cloud computing with HA]
Config #7 is config #6 with hot-standby for each of the service layers. Again this is a theoretical configuration.

Cost Impact
Any of the hot-standby configurations would have cost impact – adding any extra layer of high-availability immediately adds to the cost, at least doubling the cost of the infrastructure. This cost increase can be reduced by making only those parts of your infrastructure highly-available that affect your business the most. It depends on how much business impact does a downtime cause, and therefore how much money can be spent on the infrastructure.

One of the ways to make the configurations more cost effective is to make them active-active configuration also called a load balanced configuration – these configurations would make use of all the allocated resources and would send traffic to both the servers. This configuration is much more difficult to design – for example if you put the hot-standby-storage in active-active configuration then every “write” (DB insert) must go to both the storage-servers, writes (DB insert) must not complete on any replicas (also called mirrored write consistency).

Cloud Computing becoming mainstream
As cloud computing becomes more mainstream – larger web companies may start using these services, they may put a part of their infrastructure on a compute cloud. For example, I can imagine a cloud dedicated for “data mining” being used by several companies, these may have servers with large HDDs and memory and may specialize in cluster software such as Hadoop.

Lastly I would like to cover my favorite topic –why would I still use services that cost more for my core services instead of using cloud computing?

  1. The most important reason would be 24×7 support. Hosting providers such as servepath and rackspace provide support. When I give a call to the support at 2PM India time, they have a support guy picking up my calls – that’s a great thing. Believe me 24×7 support is a very difficult thing to do.
  2. These hosting providers give me more configurability for RAM/disk/CPU
  3. I can have more control over the network and storage topology of my infrastructure
  4. Point #2 above can give me consistent throughput and latency for I/O access, and network access
  5. These services give me better SLAs
  6. Security

About the Author

Mukul Kumar, is a founding engineer and VP of Engineering at Pubmatic. He is based in Pune and responsible for PubMatic’s engineering team. Mukul was previously the Director of Engineering at PANTA Systems, a high performance computing startup. Previous to that he joined Veritas India as the 13th employee and was Director of Engineering for the NetBackup group, one of Veritas’ main products. He has filed for 14 patents in systems software, storage software, and application software and proudly proclaims his love of π and can recite it to 60 digits. Mukul is a graduate of IIT Kharagpur with a degree in electrical engineering.

Mukul blogs at http://mukulblog.blogspot.com/, and this article is cross posted from there.

Zemanta Pixie

OpenID – Your masterkey to the net

The OpenID logoImage via Wikipedia

OpenID is a secure, customizable, user-controllable, and open mechanism to share personal information (username/password, credit card numbers, address) on the web. It will eliminate the need to enter the same information over and over again in different websites, or to remember different username/password combinations. It will be a major improvement over the current system once it gains widespread adoption. PuneTech asked Hemant Kulkarni of singleid.net to give us an introduction to OpenID, its benefits, and how it works.

Overview

In 2005, a new idea took hold and spread across the internet – OpenID. The concept is very simple – to provide users with a single unique login-password set with which they will be able to access all the different sites on the internet.

In June 2007 the OpenID Foundation was formed with the sole goal to protect OpenID. The original OpenID authentication protocol was developed by Brad Fitzpatrick, creator of popular community website LiveJournal, while working at Six Apart. The OpenID Foundation received a recent boost when the internet leaders Microsoft, Google, Yahoo! and Verisign became its corporate members.

Millions of users across the internet are already using OpenID and several thousand websites have become OpenID enabled.

Need for OpenID

The internet is fast becoming an immovable part of our everyday life. Many tasks such as booking tickets for movies, airlines, trains and buses, shopping for groceries, paying your electricity bills etc. can now be done online. Today, you can take care of all your mundane household chores at the click of a button.

When you shop online, you are usually required to use a login and a password to access these sites. This means that, as a user, you will have to maintain and remember several different login-password sets.

OpenID enables you to use just one login-password to access these different sites – making life simpler for you. With OpenID, there is no need to bother with remembering the several different logins and passwords that you may have on each different site.

Internet architecture inherently assumes that there are two key players in today’s internet world – end users who use the internet services and the websites which provide these services. It is a common misconception that OpenID-based login benefits only the end users. Of course it does. But it also has an equal value proposition for the websites that accept OpenID too.

Later, in a separate section, we will go into the details of the benefits to the websites that accept OpenID-based logins.

And before that, it is equally important to understand the few technological aspects and the various entities involved in the OpenID world.

What is OpenID

OpenID is a digital identity solution developed by the open source community. A lightweight method of identifying individuals, it uses the same framework for identifying websites. The OpenID Foundation was formed with the idea that it will act as a legal entity to manage the community and provide the infrastructure required to promote and support the use of OpenID.

In essence, an OpenID is a URL like http://yourname.SingleID.net which you can put into the login box of a website and sign in to a website. You are saved the trouble of filling in the online forms for your personal information, as the OpenID provider website shares that information with the website you are signing on to.

Adoption

As of July 2007, data shows that there are over 120 million OpenIDs on the Internet and about 10,000 sites have integrated OpenID consumer support. A few examples of OpenID promoted by different organizations are given below:

  • America Online provides OpenIDs in the form “openid.aol.com/screenname”.
  • Orange offeres OpenIDs to their 40 million broadband subscribers.
  • VeriSign offers a secure OpenID service, which they call “Personal Identity Provider”.
  • Six Apart blogging, which hosts LiveJournal and Vox, support OpenID – Vox as a provider and LiveJournal as both a provider and a relying party.
  • Springnote uses OpenID as the only sign in method, requiring the user to have an OpenID when signing up.
  • WordPress.com provides OpenID.
  • Other services accepting OpenID as an alternative to registration include Wikitravel, photo sharing host Zooomr, linkmarking host Ma.gnolia, identity aggregator ClaimID, icon provider IconBuffet, user stylesheet repository UserStyles.org, and Basecamp and Highrise by 37signals.
  • Yahoo! users can use their yahoo ids as OpenIDs.
  • A complete list of sites supporting OpenID(s) is available on the OpenID Directory.

Various Entities in OpenID

Now let us look at the various entities involved in the OpenID world.

The Open ID Entities

End user

This is the person who wants to assert his or her identity to a site.

Identifier

This is the URL or XRI chosen by the End User as their OpenID identifier.

Identity provider or OpenID provider

This is a service provider offering the service of registering OpenID URLs or XRIs and providing OpenID authentication (and possibly other identity services).

Note: The OpenID specifications use the term “OpenID provider” or “OP”.

Relying party

This is the site that wants to verify the end user’s identifier, who is also called a “service provider”.

Server or server-agent

This is the server that verifies the end user’s identifier. This may be the end user’s own server (such as their blog), or a server operated by an identity provider.

User-agent

This is the program (such as a browser) that the end user is using to access an identity provider or a relying party.

Consumer

This is an obsolete term for the relying party.

Technology in OpenID

Typically, a relying party website (like example.website.com) will display an OpenID login form somewhere on the page. Compared to a regular login form where there are fields for user name and password, the OpenID logic form only has one field for the OpenID identifier. It is often accompanied by the OpenID logo: open id logo medium. This form is in turn connected to an implementation of an OpenID client library.

The Open ID Protocol

A user will have to register and have an OpenID identifier (like yourname.openid.example.org) with an OpenID provider (like openid.example.org). To login to the relying party website, the user will have to type in their OpenID identifier in the OpenID login form.

The relying party website will typically transform the OpenID identifier into a URL (like http://yourname.openid.example.org/). In OpenID 2.0, the client will thus discover the identity provider service URL by requesting the XRDS document (which is also called the Yadis document) with the content type application/xrds+xml which is available at the target URL and is always available for a target XRI.

Now, here is what happens next. The relying party and the identity provider establish a connection referenced by the associate handle. The relying party then stores this handle and redirects the user’s web browser to the identity provider to allow the authentication process.

In the next step, the OpenID identity provider prompts the user for a password, or an InfoCard and asks whether the user trusts the relying party website to receive their credentials and identity details.

The user can either agree or decline the OpenID identity provider’s request. If the user declines, the browser is redirected to the relying party with a message to that effect and the site refuses to authenticate the user. If the user accepts the request to trust the relying party website, the user’s credentials are exchanged and the browser is then redirected to the designated return page of the relying party website. Then the relying party also checks that the user’s credentials did come from the identity provider.

Once the OpenID identifier has been properly verified, the OpenID authentication is considered successful and the user is considered to be logged into the relying party website with the given identifier (like yourname.openid.example.org). The website then stores the OpenID identifier in the user’s session.

Case Study

Now let us take a simple case of Sunil, who wants to buy a Comprehensive Guide to OpenID by Raffeq Rehman, CISSP. This e-book is available only on-line at www.MyBooks.com a technology thought leader which believes in easing the end user’s on-line experience by accepting OpenID based login.

Sunil is a tech savvy individual who has already registered himself at www.singleid.net (India’s first OpenID provider) and they have provided him with his unique login identity, which is: http://sunil.sigleid.net.

The easiest entity to recognize in this purchase scenario is Sunil, the End-User. Obviously Sunil will use his web browser, which is known as the User-agent to access the MyBooks.com.

So, Sunil visits www.MyBooks.com and starts to look for the book he wants. He follows the standard procedures on this website and chooses his book and clicks the check-out link to buy this book. First thing MyBooks.com does is asks him to log-in and gives him an option of logging in with your OpenID.

Since Sunil has already registered himself with SingleId.net, they have provided him with his login-id (which is bit different). So here, www.singleid.net is the Identity Provider or OpenID provider.

Now we know that OpenID uses same method to identify individuals, which is commonly used for identifying websites and hence his identity (Identifier in OpenID context) is http://sunil.sigleid.net. Now SingleId.net part in his identity tells MyBooks.com that he has registered himself at www.singleid.net.

At this stage MyBooks.com sends him to www.singleid.net to log in. Notice that MyBooks.com does not request Sunil to login but relies on SingleID.net. And so MyBooks.com or www.MyBooks.com is the Relying Party or the Consumer. Once Sunil complete his login process which is authenticated against the Server-Agent (typically Server-Agent is nothing but your identity provider) SingleID.net sends him back to MyBooks.com and tells MyBooks.com that Sunil is the person who he says he is, and MyBooks.com can let him complete the purchase operation.

Leading Players in the OpenID World & Important Milestones

  • Web developer JanRain was an early supporter of OpenID, providing OpenID software libraries and expanding its business around OpenID-based
  • In March 2006, JanRain developed a Simple Registration Extension for OpenID for primitive profile-exchange
  • With Verisign and Sxip Identity joining and focusing on OpenID development new standard of OpenID protocol OpenID 2.0 and OpenID Attribute Exchange extension were developed
  • On January 31, 2007, computer security company Symantec announced support for OpenID in its Identity Initiative products and services. A week later, on February 6 Microsoft made a joint announcement with JanRain, Sxip, and VeriSign to collaborate on interoperability between OpenID and Microsoft’s Windows CardSpace digital identity platform, with particular focus on developing a phishing-resistant authentication solution for OpenID.
  • In May 2007, information technology company Sun Microsystems began working with the OpenID community, announcing an OpenID program.
  • In mid-January 2008, Yahoo! announced initial OpenID 2.0 support, both as a provider and as a relying party, releasing the service by the end of the month. In early February, Google, IBM, Microsoft, VeriSign, and Yahoo! joined the OpenID Foundation as corporate board members

OpenID: Issues in Discussion and Proposed Solutions

As is the case with any technology, there are some issues in discussion with regard to OpenID and its usability and implementation. Let us have a look at the points raised and the solutions offered:

Issue I:

Although OpenID may create a very user-friendly environment, several people have raised the issue of security. Phishing and digital identity theft are the main focus of this issue. It is claimed that OpenID may have security weaknesses which might leave user identities vulnerable to phishing.

Solution Offered:

Personal Icon: A Personal Icon is a picture that you can specify which is then presented to you in the title bar every time you visit the particular site. This aids in fighting phishing as you’ll get used to seeing the same picture at the top of the page every time you sign in. If you don’t see it, then you know that something might be up.

Issue II:

People have also criticized the login process on the grounds that having the OpenID identity provider into the authentication process adds complexity and therefore creates vulnerability in the system. This is because the ‘quality’ of such an OpenID identity provider cannot be established.

Solution Offered:

SafeSignIn: SafeSignIn is an option that users can set on their settings page that allows you to choose the option where you cannot be redirected to your OpenID provider to enter a password. You can only sign-in in provider’s login page. If you are redirected to your provider from another site, you are presented with the dialog warning you not to enter your password anywhere else.

Value Proposition

There are several benefits to using OpenID – both for the users and for the websites.

Benefits for the End User:

  • You don’t have to remember multiple user IDs and passwords – just one login.
  • Portability of your identity (especially if you own the domain you are delivering your identity from). This gives you better control over your identity.

Benefits for OpenID Enabled Websites:

  • No more registration forms: With OpenID, websites can get rid of the clutter of the registration forms and allow users to quickly engage in better use of their sites, such as for conversations, commerce or feedback.
  • Increased stickiness: Users are more likely to come back since they do not have to remember an additional username-password combination.
  • Up-to-date registration information: Due to the need of frequent registrations, users often provide junk or inaccurate personal information. With OpenID, since only a one-time registration is necessary, users are more likely to provide more accurate data.

OpenID thus provides the users with a streamlined and smooth experience and website owners can gain from the huge usability benefit and reduce their clutter.

Why Relying Parties should implement OpenID based authentication?

  • Expedited customer acquisition: OpenID allows users to quickly and easily complete the account creation process by eliminating entry of commonly requested fields (email address, gender, birthdates etc.), thus reducing the friction to adopt a new service.
  • Outsourcing authentication saves costs: As a relying party you don’t have to worry about lost user names, passwords, a costly infrastructure, upgrading to new standards and devices. You can just focus on your core. From research the average cost per user for professional authentication are approximately €34 per year. In the future, the relying party will end up paying only a few Cents per authentication request (transaction based).
  • Reduced user account management costs: The primary cost for most IT organizations is resetting forgotten authentication credentials. By reducing the number of credentials, a user is less likely to forget their credentials. By outsourcing the authentication process to a third-party, the relying party can avoid those costs entirely.
  • Your customers are demanding user-centric authentication: User-centric authentication gives your customers comfort. It promises no registration hassle and low barriers of entry to your service. Offering UCA to your customers can be a unique selling point and stimulate user participation.
  • Thought leadership: There is an inherent marketing value for an organization to associate itself with activities that promote it as a thought leader. It provides the organization with the means to distinguish itself from its competitors. This is your chance to outpace your competitors.
  • Simplified user experience: This is at the end of the list because that is not the business priority. The business priority is the benefit that results from a simplified user experience, not the simplified user experience itself.
  • Open up your service to a large group of potential customers: You are probably more interested in the potential customers you don’t know, versus the customers you already service. UCA makes this possible. If you can trust the identity of new customers you can start offering services in a minute.
  • The identity provider follows new developments: When a new authentication token or protocol is introduced you don’t have to replace your whole infrastructure.
  • Time to market: Due to legislation you are suddenly confronted with an obligation to offer two factor authentications. UCA is very easy to integrate and you are up and running a lot quicker
  • Data sharing: If the identity provider also offers the option to provide additional (allowed) attributes of the end-user you don’t have to store all this data yourself. For example, if I go on a holiday for a few weeks, I just update my temporary address instead of calling the customer service of my local newspaper!
  • Quickly offer new services under your brand: If you take over a company or want to offer a third party service under your brand/ infrastructure UCA makes it much easier to manage shared users. How much time does this take at the moment?
  • Corporate image: As SourceForge states they also offer OpenID support to join the web 2.0 space and benefit from the first mover buzz. Besides adding a good authentication mechanism provided by a trusted identity provider could add value to your own service. It is like adding a trust seal of your SSL certificate provider.
  • Extra Traffic: Today you get only those users whom you solicit but miss on those who are solicited by other similar businesses like yours. OpenID brings extra traffic to your website without you spending that extra effort.
  • Analytics: Providers can give you much more analytics on your users’ behavior patterns (this can be anonymous to keep user identity private and report something like 30% of people who visit your site also visit site ‘x’).

OpenID and Info-Cards

It is believed that user-id/password based log-in is the oldest, commonly used and easily implementable, but, at the same time, a weak method of authenticating and establishing somebody’s identity.

To overcome this problem and enhance the security aspect of OpenID based login processes, OpenID providers are using new techniques such as Info-cards (virtual cards based on user PC) based authentication. Microsoft is specially working with various leading OpenID providers to make Microsoft CardSpace as the de-facto standard for OpenID authentication.

There are two types of Info-Cards, Self-issued and Managed (or Managed by the provider). Self issued are the ones which are created by user stored on her/his PC and used during the login process. Since these cards are self issued level of verification provided by the users, their use is limited to the self-verified category and as such, provides a more secure replacement for User Id / Password combination only.

On the other hand ‘Managed Cards’ are managed by the specific provider. This can be your OpenID provider or your Bank. In this scenario, the data on the card is validated by the provider significantly enhancing the value of the verification. As such, these cards can easily be used in financial transactions for easing your on-line purchase process or for proving your legal identity.

There is emerging trend to bridge the gap between info-cards (virtual) and smart-cards (physical) and establish the link between them. Data can be copied to and fro giving your virtual card a physical status. In this scenario, your Info card (which was managed by the required management authority like Bank, RTO or so on) can take the place of your identity proof.

Some Interesting Sites Which Accept OpenID

Circavie.Com

An interesting site where you can create your own ‘story of your life’ – an interactive and chronological blog site, but with a difference (and that difference is not about being OpenID enabled) – see it to believe it!

Doxory.Com

If you are the kind of person who simply cannot decide whether to do ‘x’ or ‘y’, then here is the place for you. Put up your question and random strangers from the internet post their advice.

Highrisehq.Com

Here is the perfect solution for all those internet based companies – manage your contacts, to-do lists, e-mail based notifications, and what-not on this site. If the internet is where you work, then this site is perfect for you to get managing your business smoothly!

Foodcandy.Com

If you are a foodie then this site is the place for you! Post your own recipes and access the recipes posted by other people. Read opinions of people who have tried out the different recipes. Hungry?

About SingleID

SingleID is an OpenID provider – the first in India to do so. It allows users to register and create their OpenID(s) for FREE. It provides all the typical OpenID provider functions – allowing users to create their digital identity and using that to login to several OpenID enabled websites across the internet.

OpenID is being hailed as the ‘new face of the internet’ and SingleID is bringing it close to home. The main focus area of the company is to promote usage of OpenID in India.

If a user wants, he can also create multiple SingleID(s) with different account information, to use on different sites. So it allows you – the user – to control your digital identity, much in the same way as a regular login-password would – but with the added benefits of the OpenID technology.

SingleID has created a unique platform for website owners in India to generate a smooth user experience and create a wider base of operations and access for their websites.

Other user-centric services such as Virtual Cards (for more secure authentication) or allowing the use of user specific domain name (e.g. hemant.kulkarni.name) as an OpenID will be offered very soon.

For our partners we provide secured identity storage and authentication and authorization service alleviating headaches of critical security issues related to personal data.

We also provide the OpenID enablement service. Using our services companies can upgrade their user login process by accepting the OpenID based login largely enhancing their user base.

Links for Reference

· SingleID Home Page – http://www.singleid.net and Registration – https://www.singleid.net/register.htm

· OpenID Foundation Website – http://openid.net

· The OpenID Directory – http://openiddirectory.com/

About the author: Hemant Kulkarni is a founder director of SingleID.net. He has more than 25 years of product engineering and consulting experience in domains of networking and communications, Unix Systems and commercial enterprise software. You can reach him at hemant@singleid.net.

Zemanta Pixie

Introduction to Server Virtualization

Virtualization is fast emerging as a game-changing technology in the enterprise computing space. What was once viewed as a technology useful for testing and development is going mainstream and is affecting the entire data-center ecosystem. Over the course of the next few weeks, PuneTech is going to run a series of articles on virtualization from experts in the industry. This article, the first in the series, gives an introduction to server virtualization, and has been written by Anurag Agarwal and Anand Mitra, founders of KQ Infotech.

What is virtualization

Virtualization is essentially some kind of abstraction of computing resources. There are various kinds of abstractions. Files provide an abstraction of disk blocks into linear space. Storage virtualization products, like logical volume manager, virtualize multiple storage devices into single storage and vice versa.

Processes are also a form of virtualization. A process provides an illusion to the programmer that she has the entire address space at her disposal and has exclusive control of hardware resources. Multiplexing of these resources between all the processes on the system is done by the OS, transparent to the process. This concept has been universally adopted.

All multi-programming operating systems are characterized by executing instructions in at least two privilege levels i.e. unprivileged for user programs, and privileged for the operating system. The user programs use “system calls” to request the operating system to perform privileged operations on its behalf. The interface which consists of the unprivileged instruction set and the set of system calls define an “extended machine” which is easier to program than the bare machine and makes user programs more portable.

The benefits of having the kernel wrapping completely around the hardware and not exposing it to upper layer has its advantages. But in this model, only one operating system can be run at a given time. One cannot perform any activity that would disrupt the running system (for example, upgrade, migration, system debugging, etc.)

A virtual machine provides an abstraction of complete physical machine. This is the also known as server virtualization. The basic idea is to run more than one operating system on the single server at the same time.

The History of Server Virtualization

In 1964, IBM had developed a Virtual Machine Monitor (CP) to run their various OSes on their mainframes. Hardware was too expensive to leave underutilized. They had addressed many of the performance challenges inherent in virtualization by designing hardware amenable to virtualization. However with the advent of cheap computing resources and proliferation of commodity hardware, virtualization was no longer popular and was viewed as a artifact of a an era where computing resources were scarce. This was reflected in design of x86 architectures which no longer provided enough support to implement virtualization efficiently.

With the cost of hardware going down and complexities of software increasing, a large number of administrators started putting one application per server. This provides them isolation, where one application does not interfere with other application. However, over some time it started resulting into a problem called server sprawl. There are too many underutilized servers in data centers. Most windows servers have average utilization between 5% and 15%. This utilization rate will further go down with dual core and quad core processors becoming very common. In addition to the cost of the hardware, there are also power and cooling requirements for all these servers. The earlier problem of utilization of hardware resources has started surfacing again.

Ironically the very reason which resulted in the demise of virtualization in the mainstream, was the cause of its resurrection. The features which made the OSes attractive, also made them more fragile. And this renewed interest in virtualization resulted into VMWare providing a server virtualization solution for x86 machines in 1999. Server consolidation has increased the server utilization to the 60% to 80% level. This has resulted in 5 to 15 times reduction in the servers.

Virtual machines have introduced a whole new paradigm of looking at operating systems. Traditionally they were coupled with physical machines, and they needed to know all the peculiarities of hardware. Once hardware becomes obsolete, your operating system becomes obsolete too. But virtual machines have changed that notion. They have decoupled the operating systems from hardware by introducing a virtualization layer called virtual machine monitor (VMM).

Types of Virtualization architectures

There are many VMM architectures.

Full emulation: It is the oldest virtualization technique in use. An emulator is a software layer which tracks the memory and CPU state of the machine being emulated and interprets each instruction applying the effect it would have on the virtual state of the machine it has constructed. In a regular server, machine instructions are directly executed by the CPU and the memory is directly manipulated. In full emulation, the instructions are handed over to the emulator and it then converts these instructions into a (possibly different) set of instructions to be executed on the actual underlying physical machine. Full emulation is routinely used during the development of software for new hardware which might not be available yet. Virtualization can be considered as a special case of emulation where both the machine being emulated and host are similar. This allows execution of unprivileged instructions natively. Qemu falls in this category.

Hosted: In this approach, a traditional operating system (Windows or Linux) runs directly on the hardware. This is called the host OS. VMM is installed as a service in the host OS. This application creates and manages multiple virtual machines as processes. Each virtual machine process has a full operating system inside it. Each of these is called a guest OS. This approach greatly simplifies the design of the VMM as it can directly use the services provided by the host operating system. VMWare server, VMWare workstation, Virtual box, and KVM fall in this category.

Hypervisor based: Hosted VMM solutions have a high overhead, as the VMM does not directly control the hardware. In the hypervisor approach the VMM is directly installed on the hardware. The VMM provides virtual hardware abstractions to create and manage multiple virtual machines. Performance overhead in this approach is very small.

Another way to classify virtual machines is on the basis of how privileged instructions are handled. Modern processors have a privileged mode of execution that the OS kernel executes in, and a non-privileged mode that the user programs execute in. This can cause a problem for virtual machines because although the host OS (or the hypervisor) runs in privileged mode the entire guest OS runs in non-privileged mode. Most of today’s OSs are specifically designed to run in privileged mode, and hence their binaries end up having some instructions that must be run in privileged mode. (For example, there are 17 such instructions in the Intel IA-32 architecture.) This causes a problem for the virtual machine, and there are two major approaches to handling this problem.

Para virtualization: In this approach, the binary of the OS needs to be rewritten statically to replace the use of the privileged instructions by appropriate calls into the hypervisor. In other words, the operating system needs to be ported to the virtual hardware abstraction provided by VMM. This requires changes in the operating system code. This approach has least performance penalty. This is the approach taken by Xen.

Full virtualization: In this approach, no change is made in the operating system code. There are two ways of supporting this.

  • Using run time emulation of the privileged instructions. The VMM monitors program execution during runtime, and takes over control of execution whenever a privileged instruction arises in the guest OS. This approach is called binary translation. VMWare uses this technology.
  • Hardware assisted virtualization: Both intel and AMD have come up virtualization extensions of their hardware to support virtualization. Intel calls this VT technology and AMD calls it SVM technology. These extensions provide an extra privilege level for VMM to run. These extensions have provided a number of additional features like nested page tables and IOMMU, to make virtualization more efficient.

Virtualization Vendors

VMWare: VMWare has a suite of products in this area. There are two hosted products, called VMWare workstation and VMWare server. Their hypervisor product is called VMWare ESX. They have one version of ESX that comes burned in the bios. It is called VMWare ESXi. They have virtual center as management product to manage complete virtual machine infrastructure in the data center. There all the products are based on the dynamic binary translation technology. They support various flavors for Windows and Linux.

Xen: It is an open source project. It is based on para-virtualization and hypervisor technologies. Linux is modified to support para-virtualization. Xen now supports Windows with hardware assisted virtualization. There are number of products based on Xen. Citrix, which bought XenSource has couple of Xen based products, Sun has xVM, Oracle has Oracle VM. Redhat and Suse have been shipping Xen as part of their Linux distributions for some time.

Hyper-V: This is Microsoft’s entry in this space. It is similar to the Xen architecture. It also requires hardware assistance. It comes bundled with Windows server 2008, and supports running Windows and Linux guest operating systems in the virtual machines.

Advantages of Virtualization

Virtualization has also provided some new capabilities. Server provisioning becomes very easy. It is just creating and managing a virtual machine. This has transformed the way testing and development are done. There is another interesting feature called Vmotion or live migration, where a running virtual machine can be moved from one physical machine to other physical machine. Executing of the virtual machine is briefly suspended, and the entire image of the virtual machine is moved to a different machine. Now the execution can be re-started and the guest OS will continue from exactly the same point where it was suspended. This eliminates the need for downtime, even for things like hardware maintenance. This also enables the dynamic resource management or utility computing.

Adoption of server virtualization has been phenomenal. There are already hundreds of thousands servers running virtual machines. Initial adoption of virtual machine was restricted only to test and development, but now it has matured enough to become quite popular in production too.

About the Authors

Anurag Agarwal

Anurag Agarwal has more than 11 years of industry experience both in India and US. Prior to founding the KQInfotech, he was a technical director at Symantec India. Anurag has designed, developed various products at Symantec (earlier Veritas). During 2006-2007, Anurag has conceived the idea of Software Fault Tolerance for Xen at Symantec. He was awarded highest technical award of outstanding innovator in 2006 for this invention. Anurag has build and lead a team of ten people in India to take it from idea stage to product.

During the same time Anurag has started working with College of Engineering, Pune. There he and his friends offered a full semester course in Linux kernel. Anurag was also involved in mentoring a large number of students from various engineering colleges. This involvement in teaching and mentoring students resulted in formation of KQInfotech with training and mentoring focus.Prior to this, Anurag has architected scalable transaction system for Cluster file system at Symantec in USA. This architecture improved scalability of cluster file system from three nodes to sixteen nodes and beyond. He was awarded star award for this work. He has filed half a dozen patents at Symantec. Anurag has extensive knowledge in Solaris, Linux kernel, file system, storage technologies and virtualization.He has ME from Indian Institute of Science, Bangalore and BE from MBM Engineering College, Jodhpur.

Anand Mitra
After completing his post-graduation (iitb) in 2001, Anand had been working with Symantec India (Formerly Veritas). Prior to founding KQInfotech, he was a Principal Software Engineer at Symantec, chartered with the task of scoping and designing a support for windows on Xen based Fault Tolerance. He has worked for 6.5 years on the clustered filesystem product VxFS & CFS. He had architected the online upgrade for Veritas File system and designed the write fastpath which improved performance of the file system. He has also designed the integration of Power6 (powerPC) CPU feature of storage keys for the Veritas storage stack. He co-maintained technical relations with IBM for special proprietary kernel interfaces within AIX and designed a filesystem pre-allocation API for IBM DB2 database.

An ILM approach to managing unstructured electronic information

(by Bob Spurzem, Director of International Business, Mimosa Systems, and T.M. Ravi, Founder, President and CEO, Mimosa Systems. This article is reposted with permission from CSI Pune‘s newsletter, DeskTalk. The full newsletter is available here.)

In this era of worldwide electronic communication and fantastic new business applications, the amount of unstructured, electronic information generated by enterprises is exploding. This growth of unstructured information is the driving force of a significant surge in knowledge worker productivity and is creating an enormous risk for enterprises with no tools to manage it. Content Archiving is a new class of enterprise application designed for managing user-generated, unstructured electronic information. The purpose of Content Archiving is to manage unstructured information over the entire lifecycle, ensuring preservation of valuable electronic information, while providing easy access to historical information by knowledge workers. With finger-tip access to years of historical business records, workers make informed decisions driving top line revenue. Workers with legal and compliance responsibility search historical records easily in response to regulatory and litigation requests; thereby reducing legal costs and compliance risk. Using Content Archiving enterprises gain “finger-tip” access to important historical information – an important competitive advantage helping them be successful market leaders.

Unstructured Electronic Information

One of the most remarkable results of the computer era is the explosion of usergenerated electronic digital information. Using a plethora of software applications, including the widely popular Microsoft® Office® products, users generate millions of unmanaged data files. To share information with co-workers and anyone else, users attach files to email and instantly files are duplicated to unlimited numbers of users worldwide. The University of California, Berkeley School of Information Management and Systems measured the impact of electronically stored information and the results were staggering. Between the years 1992 to 2003, they estimated that total electronic information grew about 30% per year. Per user per year, this corresponds to almost 800 MB of electronic information. And the United States accounts for only 40% of the world’s new electronic information.

  • Email generates about 4,000,000 terabytes of new information each year — worldwide.
  • Instant messaging generates five billion messages a day (750GB), or about 274 terabytes a year.
  • The World Wide Web contains about 170 terabytes of information on its surface; in volume this is seventeen times the size of the Library of Congress print collections.

This enormous growth in electronic digital information has created many unforeseen benefits. Hal R. Varian, a business professor at the University of California, Berkeley, notes that, “From 1974 to 1995, productivity in the United States grew at around 1.4 percent a year. Productivity growth accelerated to about 2.5 percent a year from 1995 to 2000. Since then, productivity has grown at a bit over 3 percent a year, with the last few years looking particularly strong. But unlike the United States, European countries have not seen the same surge in productivity growth in the last ten years. The reason for this is that United States companies are much farther up the learning curve than European companies for applying the benefits of information technology.”

Many software applications are responsible for the emergence of the electronic office place and this surge in productivity growth, but none more so than email. From its humble beginning as a simple messaging application for users of ARPANET in the early 1970’s, email has grown to become the number one enterprise application. In 2006, over 171 billion emails were being sent daily worldwide, a 26% increase over 2005 and this figure is forecasted to continue to grow 25-30% throughout the remaining decade. A new survey by Osterman Research polled 394 email users, “How important is your email system in helping you get your work done on a daily basis?” 76% reported that it is “extremely important”. The Osterman survey revealed that email users spend on average 2 hours 15 minutes each day doing something in email, but 28% users spend more than 3 hours per day using email. As confirmed by this survey and many others, email has become the most important tool for business communication and contributes significantly to user productivity.

The explosive growth in electronically stored information has created many challenges and has created a fundamental change in the way electronic digital information is accessed. Traditionally, electronic information was managed in closely guarded applications used by manufacturing, accounting and engineering and only accessed by a small number of trained professionals. These forms of electronic information are commonly referred to as structured electronic information. User-generated electronic information is quite different because it is in the hands of all workers – trained and untrained. User-generated information is commonly referred to as unstructured electronic information. Where many years of IT experience have solved the problems of managing structured information; the tools and methods necessary to manage unstructured information, for the most part, do not exist. For a typical enterprise, as much as 50% of total storage capacity is consumed by unstructured data and another 15-20% is made up of email data. The remaining 25-30% of enterprise storage is made up of structured data in enterprise databases.

Content Archiving

User-generated, unstructured electronic information is creating a chasm between IT staff whose responsibility it is to manage electronic information and knowledge workers who want to freely access current and historical electronic information. Knowledge workers desire “finger-tip” access to years of information which strains the ability of IT to provide information protection and availability, cost effectively. Compliance officers desire tools to search electronic information and preserve information for regulatory audits. And overshadowing everything is the need for information security. User-generated electronic information is often highly sensitive and requires secure access. As opposed to information that exists on the World Wide Web, electronic information that exists in organizations is meant only for authorized access.

Content Archiving represents a new class of enterprise application designed for managing user-generated unstructured electronic information in a way that addresses the needs of IT, knowledge workers and compliance officers. The nature of Content Archiving is to engage seamlessly with the applications that generate unstructured electronic information in a continuous manner for information capture, and to provide real-time end-user access for fast search and discovery. The interfaces currently used to access unstructured information (e.g. Microsoft Outlook®) are the same interfaces used by Content Archiving to provide end users with secure “finger-tip” access to volumes of electronic information.

Content Archiving handles a large variety of user-generated electronic information. Email is the dominate form of usergenerated electronic information and is included in this definition. So too are Microsoft Office files (e.g. Word, Excel, PowerPoint, etc.) and the countless other file formats such as .PDF and .HTML. Files that are commonly sent via email as attachments are included in both the context of email and as standalone files. In addition to email and files, there are a large number of information types that are not text based, and include digital telephony, digital pictures and digital movies. The growing popularity of digital pictures (.JPG), audio and voice mail files (.WAV, .WMA) and video files is paving the way for a new generation of communication applications. It is within reason that in the near future full length video recordings will be shared just as easily as Excel spreadsheets are today. All these user-generated data types fall under the definition of Content.

Content Archiving distinguishes itself from traditional data protection applications. Data protection solves the important problem of restoring electronic information, but does little more. Archiving, on the other hand, is a business intelligence application that solves problems such as providing secure access to electronic information for quick search and legal discovery; measuring how much information exists; identifying what type of data exists; locating where data exists and determining when data was last accessed. For managing unstructured electronic information, Content Archiving delivers important benefits for business intelligence and goes far beyond the simple recovery function that data protection provides. Using tools that archiving provides, knowledge users can easily search years of historical information and benefit from the business information contained within.

Information Life-Cycle Management

The three phases of unstructured information

Content Archiving recognizes that electronic information must be managed over its entire life-cycle. Information that was recently created has different needs and requirements than the same information years later and should be managed accordingly. Three distinct phases exist for the management of electronic information, which are the recovery phase, discovery phase and compliance phase (see figure). It is the strategic purpose of Content Archiving to manage electronic information throughout the entire life-cycle; recognizing the value of information in the short-term for production and long-term as a record of business; while continually driving information storage levels down to reduce storage costs and preserving access to information.

During the recovery phase all production information requires equal protection and must be available for fast recovery should a logical error occur, or a hardware failure strikes the production servers. Continuous capture of information reduces the risk of losing information and supports fast disk-based recovery. The same information stores can be accessed by end users who desire easy access to restore deleted files. Content Archiving supports the recovery phase by performing as a disk-based continuous data protection application. As compared to tape-based recovery, Content Archiving can restore information more quickly and with less loss of information. It captures all information in real-time and it keeps all electronic information on cost-efficient storage where it is available for fast recovery and also can be easily accessed by end users and auditors for compliance and legal discovery. The length of the recovery phase varies according to individual needs, but is typically 6-12 months.

At a point in time, which varies by organization, the increasing volume of current and historical information puts an unmanageable strain on production servers. At the same time the value of the historical electronic information decreases because it is no longer required for recovery. This is called the discovery phase. The challenge in the discovery phase is to reduce the volume of historical information while continuing to provide easy access to all information for audits and legal discovery. Content Archiving provides automated retention and disposition policies that are intelligent and can distinguish between current information and information that has been deleted by end users. Retention rules automatically dispose of information according to policies defined by the administrator. Further reduction is achieved by removing duplicates. For audits and legal discovery, Content Archiving keeps information in a secure, indexed archive and provides powerful search tools that allow auditors quick access to all current and historical information. By avoiding using backup tapes, searches of historical information can be performed quickly and reliably; thereby reducing legal discovery costs.

Following the discovery phase, electronic information must be managed and preserved according to industry rules for records retention. This phase is called the compliance phase. Depending on the content, information may be required to be archived indefinitely. Storing information long-term is a technical challenge and costly if not done correctly. Content Archiving addresses the challenges of the compliance phase in two ways. First, Content Archiving provides tools which allow in-house experts, who know best what information is a record of business, to preserve information. Discovery tools enable auditors and legal counsel to flag electronic information as a business record or disposable. Second, Content Archiving manages electronic information in dedicated file containers. File containers are designed for long-term retention on tiered storage (e.g. tape, optical) for economic reasons and have self-contained indexes for reliable long-term access of information.

Conclusion

The explosive growth in user-generated electronic information has been a powerful benefit to knowledge worker productivity but has created many challenges to enterprises. IT staff is challenged to manage the rapidly growing information stores while keeping applications running smoothly. Compliance and legal staff are challenged to respond to regulatory audits and litigation requests to search and access electronic information quickly. Content Archiving is a new class of enterprise application designed to manage unstructured electronic information over its entire life-cycle. Adhering to architectural design rules that ensure no interruption to the source application, secure access and scalability, Content Archiving manages information upon creation during the recovery phase, the discovery phase and the compliance phase where information is preserved as a long-term business record. Content Archiving provides IT staff, end users as well compliance and legal staff with the business intelligence tools they require to manage unstructured information economically while meeting demands for quick, secure access and legal and regulatory preservation.

About the Authors

Bob Spurzem
Director International Business
Mimosa Systems

Bob has 20+ years experience in high technology product development and marketing and currently he is a Director International Business with Mimosa Systems Inc. With significant experience throughout the product life cycle, from market requirements and competitive research, through positioning, sales collateral development and product launch, he has a strong focus in bringing new products to market. Prior to this his experience includes work as a Senior Product Marketing Manager for Legato Systems and Veritas Software companies. Robert has an MBA from Santa Clara University and a Masters Degree in Biomedical Engineering from Northwestern University

T. M. Ravi
Co-founder, President, and CEO
Mimosa Systems

T. M. Ravi has had a long career with broad experience in enterprise management and storage. Before Mimosa Systems, Ravi was founder and CEO of Peakstone Corporation, a venture-financed startup providing performance management solutions for Fortune 500 companies. Previously, Ravi was vice president of marketing at Computer Associates (CA). At Computer Associates, Ravi was responsible for the core line of CA enterprise management products, including CA Unicenter, as well as the areas of application, systems and network management; software distribution; and help desk, security, and storage management. He joined CA through the $1.2 billion acquisition of Cheyenne Software, the market leader in storage management and antivirus solutions. At Cheyenne Software, Ravi was the vice president responsible for managing the company’s successful Windows NT business with products such as ARCserve backup and InocuLAN antivirus. Prior to Cheyenne, Ravi founded and was CEO of Media Blitz, a provider of Windows NT storage solutions that was acquired by Cheyenne Software. Earlier in his career, Ravi worked in Hewlett-Packard’s Information Architecture Group, where he did product planning for client/server and storage solutions.

Zemanta Pixie