Tag Archives: archival

CSI-Pune’s ILM Seminar – A Report

CSI-Pune conducted a half-day workshop on Information Life-cycle management. T.M. Ravi, founder and CEO of Mimosa Systems gave the keynote presentation. There were product/project pitches from IBM, Zmanda, Coriolis. A talk on storage trends by Abhinav Jawadekar. Finally a panel discussion with representation from Symantec (V. Ganesh), BMC (Bladelogic; Monish Darda), Zmanda (K K George), IBM, Symphony (Surya Narayanan), and nFactorial (Hemant Joshi).

Here are my cryptic notes from the conference:

  • T.M. Ravi, CEO of Mimosa, gave talk on what he sees as challenges in storage/ILM. New requirements coming from the customers – Huge amounts of user-generated unstructured data in enterprises. Must manage it properly for legal, security and business reasons. Interesting new trends coming from the technology side – new/cheap disks. De-duplication. Storage intensive apps (eg. video). Flash storage. Green storage (i.e. energy conscious storage). SaaS and storage in the cloud (e.g. Amazon S3). Based on this, storage software should focus on these things: 1. Increase Information content of data 2. Improve security. 3. Reduce legal risk. Now he segues into a pitch for Mimosa’s products. i.e. You must have an enterprise-wide archive: 1. continuous capture (i.e. store all versions of the data). 2. Full text indexing of all the content and allow users to search by keyword. 3. Single instance storage (SIS) aka De-duplication, to reduce the storage requirements. 4. Retention policies. Mimosa is an archiving appliance that can be used for 1. ediscovery, 2. recovery, 3. end-user searches, 4. storage cost reduction.
  • Then there was a presentation from IBM on General Parallel File System (GPFS). Parallel, highly available distributed file system. I did not really understand how this is significantly different from all the other such products already out there. Also, I am not sure what part of this work is being done in Pune. Caching of files over WAN in GPFS (to improve performance when it is being accessed from a remote location) is being developed here (Ujjwal Lanjewar).
  • There was also a presentation on the SAN simulator tool. This is something that allows you to simulate a storage area network, including switches and disk arrays. It has been open-sourced and can be downloaded here. A lot of the work for this tool happens in Pune (Pallavi Galgali).
  • KKG from Zmanda demonstrating recovery manager for MySQL. This whole product has been architected and developed in Pune
  • Bernali from Coriolis demonstrated CoLaMa – a virtual machine lifecycle manager a virtual machine lifecycle manager. This is essentially CVS for virtual machine images. A version management software to keep track of all your VM images. Check out image. Work on it. Check it in. A new version gets stored in the repository. And it only stores the differences between the image – so space savings. It auto-extracts info like OS info, patchlevel etc.
  • Coriolis’ was the only live demo. The others were flash demos which looked lame (and had audio problems). Suggestion to all – if you are going to give a flash demo, at least turn off the audio and do the talking yourself. This would involve the audience much better.
  • Abhinav Jawadekar gave nice introductory talk on the various interesting technologies and trends in storage. It would have been very useful and helpful for someone new to the field. However, in this case, I think it was wasted on audience most of who’ve been doing this for 5+ years. The only new stuff was in the last few slides that were about energy aware storage (aka green storage). (For example, he pointed out that data-center class storage in Pune is very expensive due to the high storage costs – due to power, cooling, UPS, genset, the operating costs of a 42U rack are $800 to $900 per month.)
  • The panel discussion touched upon a number of topics, not all of them interesting. I did not really capture notes of that.

Overall, it was an interesting evening. With about 50 people attending, the turnout was a little lower than I expected. I’m not sure what needs to be done in Pune to get people to attend. If you have suggestions, let me know. If you are interested in getting in touch with any of the people mentioned above, let me know, and I can connect you.

Archival, e-Discovery and Compliance

Archival of e-mails and other electronic documents, and the use of such archives in legal discovery is an emerging and exciting new field in enterprise data management. There are a number of players in this area in Pune, and it is, in general, a very interesting and challenging area. This article gives a basic background. Hopefully, this is just the first in a series of articles and future posts will delve into more details.


In the US, and many other countries, when a company files a lawsuit against another, before the actual trial takes place, there is a pre-trial phase called “discovery”. In this phase, each side asks the other side to produce documents and other evidence relating to specific aspects of the case. The other side is, of course, not allowed to refuse and must produce the evidence as requested. This discovery also applies to electronic documents – most notably, e-mail. Now unlike other documents, electronic documents are very easy to delete. And when companies were asked to produce certain e-mails in court as part of discovery, more and more of them began claiming that relevant e-mails had already been deleted, or they were unable to find the e-mails in their backups. The courts are not really stupid, and quickly decided that the companies were lying in order to avoid producing incriminating evidence. This gave rise to a number of laws in recent times which specify, essentially, that relevant electronic documents cannot be deleted for a certain number of years, and that they should be stored in an easily searchable archive, and that failure to produce these documents in a reasonable amount of time is a punishable offense. This is terrible news for most companies, because now, all “important” emails (for a very loose definition of “important”) must be stored for many years. Existing backup systems are not good enough, because those are not really searchable by content. And the archives cannot be stored on cheap tapes either, because those are not searchable. Hence, they have to be on disk. This is a huge expense. And a management nightmare. But failure to comply is even worse. There have been actual instances of huge fines (millions of dollars, and in once case, a billion dollars) imposed by courts on companies that were unable to produce relevant emails in court. In some cases, the company executives were slapped with personal fines (in addition to fines on the company). On the other hand, this is excellent news for companies that sell archival software that helps you store electronic documents for the legally sufficient number of years in a searchable repository. The demand for this software, driven by draconian legal requirments, is HUGE, and an entire industry has burgeoned to service this market. Just e-mail archival soon be a billion dollar market. (Update: Actually it appears that in 2008, archival software alone is expected to touch 2 billion dollars with a growth rate of 47% per year, and e-discovery and litigation support software market will be 4 billion growing at 27%. And this doesn’t count the e-discovery services market which is much much larger.) There are three major chunks to this market:

  • Archival – Ability to store (older) documents for a long time on cheaper disks in a searchable repository
  • Compliance – Ensuring that the archival store complies with all the relevant laws. For example, the archive must be tamperproof.
  • e-Discovery – The archive should have the required search and analysis tools to ensure that it is easy to find all the relevant documents required in discovery


Archival software started its life before the advent of these compliance laws. Basic email archival is simply a way to move all your older emails out of your expensive MS exchange database, into cheaper, slower disks. And shortcuts are left in the main Exchange database so that if the user ever wants to access one of these older emails, it is fetched on demand from the slower archival disks. This is very much like a page fault in virtual memory. The net effect is that for most practical purposes, you’ve increased the sizes of peoples’ mailboxes without a major increase in price, without a decrease in performance for recent emails, and some decrease in performance for older emails. Unfortunately, these guys had only middling success. Think about it – if your IT department is given a choice between spending money on an archival software that will allow them to increase your mailbox size, or simply telling all you users to learn to live with small mailbox sizes, what would they choose? Right. So the archival software companies saw only moderate growth. All of this changed when the e-discovery laws came into effect. Suddenly, archival became a legal requirement instead of a good-to-have bonus feature. Sales exploded. Startups started. And it also added a bunch of new requirements, described in the next two sections.


Before I start, I should note that “IT Compliance” in general is really a huge area and includes all kinds of software and services required by IT to comply with any number of laws (like Sarbanes Oxley for accounting, HIPAA for medical records, etc.) That is not the compliance I am referring to in this article. Here we only deal with compliance as it pertains to archival software. The major compliance requirement is that for different types of e-mails and electronic documents, the laws prescribe the minimum number of years for which they must be retained by the company. And, no company wants to really keep any of these documents a day more than is minimally required by the law. Hence, for each document, the archival software must maintain the date until which the document must be retained, and on that day, it must automatically delete that document. Except, if the document is “relevant” to a case that is currently running. Then the document cannot be deleted until the case is over. This allows us to introduce the concept of a legal hold (or a deletion hold) that is placed on a document or a set of documents as soon as it is determined that it is relevant to a current case. The archival software ensures that documents with a deletion hold are not deleted even if their retention period expires. The deletion hold is only removed after the case is over. The archival software needs to ensure that the archive is tamperproof. Even if the CEO of the company walks up to the system one day in the middle of the night, he should not be able to delete or modify anything. Another major compliance requirement is that the archival software must make it possible to find “relevant” documents in a “reasonable” amount of time. The courts have some definition of what “relevant” and “reasonable” mean in this context, but we’ll not get into that. What it really means for the developers, is that there should be a fairly sophisticated search facility that allows searches by keywords, by regular expressions, and by various fields of the metadata (e.g., find me all documents authored by Basant Rajan from March to September 2008).


Sadly, just having a compliant archive is no longer good enough. Consider a typical e-discovery scenario. A company is required to produce all emails authored by Basant Rajan pertaining to the volume manager technology in the period March to September 2008. Now just producing all the documents by Basant for that period which contain the words “volume manager” is not good enough. Because he might have referred to it as “VM“. Or he might have just talked about space optimized snapshots without mentioning the words volume manager. So, what happens is that all emails written by Basant in that period are retreived, and a human has to go through each email, to determine if it is relevant to volume manager or not. And this human must be a lawyer. Who charges $4 per email because he has to pay off his law school debt. For a largish company, a typical lawsuit might involve millions of documents. Literally. Now you know why there is so much money in this market. Just producing all documents by Basant and dumping them on the opposing lawyers is not an option. Because the company does not want to disclose to the opposing side anything more than is absolutely necessary. Who knows what other smoking guns are present in Basant’s email? Thus, a way for different archival software vendors to differentiate themselves is the sophistication they can bring to this process. The ability to search for concepts like “volume management” as opposed to the actual keywords. The ability to group/cluster a set of emails by concepts. The ability to allow teams of people to collaboratively work on this job. The ability to search for “all emails which contain a paragraph similar to this paragraph“. If you know how to do this last part, I know a few companies that would be desperate to hire you.

What next?

In Pune, there are at least two companies Symantec, and Mimosa Systems, working in this area. (Mimosa’s President and CEO, T.M. Ravi, is currently in town and will give the keynote for CSI-Pune’s ILM Seminar this Thursday. Might be worth attending if you are interested in this area.) I also believe that CT Summation’s CaseVault system also has a development team here, but I am unable to find any information about that – if you have a contact there, please let me know. For some (possibly outdated) details of the other (worldwide) players in this market, see this report from last year. If you are from one of these companies, and can write an article on what exactly your software does in this field, and how it is better than the others, please let me know. I also had a very interesting discussion with Paul C. Easton of True Legal Partners, an e-Discovery outsourcing firm, where we talked about how they use archiving and e-discovery software, but more generally we also talked about how legal outsourcing to India, suitability of Pune for the same, competition from China, etc. I will write an article on that sometime soon – stay tuned (by subscribing to the PuneTech by email or via RSS)

CSI-Pune Seminar on Information Lifecycle Management – 29 May

What: Seminar on Information Lifecycle Management. ILM consists all the technologies required during the lifetime of some data stored in an enterprise. How data comes in, where it is stored, the storage hardware/software and architecture, how it is archived and backed up, retention policies, and deletion policies.

When: Thursday, 29 May 2008, 4pm to 9pm

Where: National Insurance Academy 25,Balewadi, Baner Road

Fees: Rs. 400 for CSI members, Rs. 500 for others

Registration: register online

Detailed Program:

3:30 pm – 4:00 pm : Registration

4:00 pm – 4:15 pm : Inauguration and release of CSI Newsletter

4:15 pm – 5:15 pm : Keynote address – T. M. Ravi (President and CEO Mimosa Systems)

5:15 pm – 5:45 pm : Tea Break

5:45 pm – 6:45 pm
Demonstration of products – IBM, Zmanda Technologies, Coriolis

6:45 pm – 7:30 pm
Technical talk – Most promising new technologies in Storage and ILM space by by Abhinav Jawadekar – Founder, Sound Paradigm, Software Engineering Services.

7:30 pm – 8:15 pm
Panel discussion – Technology Trends in Storage and its correlation with Career Opportunities . Panelists are Surya Narayanan (Symphony), K K George (Zmanda Technologies), Monish Darda (Bladelogic), Bhushan Pandit (Nes technologies), V. Ganesh (Symantec). Moderated by Hemant Joshi (nFactorial Software)

8:15 pm onwards: Dinner

The event is open to everybody, but you have to register online. Fees are Rs. 400 for members, Rs. 500 for non-members.