Archival, e-Discovery and Compliance

Archival of e-mails and other electronic documents, and the use of such archives in legal discovery is an emerging and exciting new field in enterprise data management. There are a number of players in this area in Pune, and it is, in general, a very interesting and challenging area. This article gives a basic background. Hopefully, this is just the first in a series of articles and future posts will delve into more details.

Background

In the US, and many other countries, when a company files a lawsuit against another, before the actual trial takes place, there is a pre-trial phase called “discovery”. In this phase, each side asks the other side to produce documents and other evidence relating to specific aspects of the case. The other side is, of course, not allowed to refuse and must produce the evidence as requested. This discovery also applies to electronic documents – most notably, e-mail. Now unlike other documents, electronic documents are very easy to delete. And when companies were asked to produce certain e-mails in court as part of discovery, more and more of them began claiming that relevant e-mails had already been deleted, or they were unable to find the e-mails in their backups. The courts are not really stupid, and quickly decided that the companies were lying in order to avoid producing incriminating evidence. This gave rise to a number of laws in recent times which specify, essentially, that relevant electronic documents cannot be deleted for a certain number of years, and that they should be stored in an easily searchable archive, and that failure to produce these documents in a reasonable amount of time is a punishable offense. This is terrible news for most companies, because now, all “important” emails (for a very loose definition of “important”) must be stored for many years. Existing backup systems are not good enough, because those are not really searchable by content. And the archives cannot be stored on cheap tapes either, because those are not searchable. Hence, they have to be on disk. This is a huge expense. And a management nightmare. But failure to comply is even worse. There have been actual instances of huge fines (millions of dollars, and in once case, a billion dollars) imposed by courts on companies that were unable to produce relevant emails in court. In some cases, the company executives were slapped with personal fines (in addition to fines on the company). On the other hand, this is excellent news for companies that sell archival software that helps you store electronic documents for the legally sufficient number of years in a searchable repository. The demand for this software, driven by draconian legal requirments, is HUGE, and an entire industry has burgeoned to service this market. Just e-mail archival soon be a billion dollar market. (Update: Actually it appears that in 2008, archival software alone is expected to touch 2 billion dollars with a growth rate of 47% per year, and e-discovery and litigation support software market will be 4 billion growing at 27%. And this doesn’t count the e-discovery services market which is much much larger.) There are three major chunks to this market:

  • Archival – Ability to store (older) documents for a long time on cheaper disks in a searchable repository
  • Compliance – Ensuring that the archival store complies with all the relevant laws. For example, the archive must be tamperproof.
  • e-Discovery – The archive should have the required search and analysis tools to ensure that it is easy to find all the relevant documents required in discovery

Archival

Archival software started its life before the advent of these compliance laws. Basic email archival is simply a way to move all your older emails out of your expensive MS exchange database, into cheaper, slower disks. And shortcuts are left in the main Exchange database so that if the user ever wants to access one of these older emails, it is fetched on demand from the slower archival disks. This is very much like a page fault in virtual memory. The net effect is that for most practical purposes, you’ve increased the sizes of peoples’ mailboxes without a major increase in price, without a decrease in performance for recent emails, and some decrease in performance for older emails. Unfortunately, these guys had only middling success. Think about it – if your IT department is given a choice between spending money on an archival software that will allow them to increase your mailbox size, or simply telling all you users to learn to live with small mailbox sizes, what would they choose? Right. So the archival software companies saw only moderate growth. All of this changed when the e-discovery laws came into effect. Suddenly, archival became a legal requirement instead of a good-to-have bonus feature. Sales exploded. Startups started. And it also added a bunch of new requirements, described in the next two sections.

Compliance

Before I start, I should note that “IT Compliance” in general is really a huge area and includes all kinds of software and services required by IT to comply with any number of laws (like Sarbanes Oxley for accounting, HIPAA for medical records, etc.) That is not the compliance I am referring to in this article. Here we only deal with compliance as it pertains to archival software. The major compliance requirement is that for different types of e-mails and electronic documents, the laws prescribe the minimum number of years for which they must be retained by the company. And, no company wants to really keep any of these documents a day more than is minimally required by the law. Hence, for each document, the archival software must maintain the date until which the document must be retained, and on that day, it must automatically delete that document. Except, if the document is “relevant” to a case that is currently running. Then the document cannot be deleted until the case is over. This allows us to introduce the concept of a legal hold (or a deletion hold) that is placed on a document or a set of documents as soon as it is determined that it is relevant to a current case. The archival software ensures that documents with a deletion hold are not deleted even if their retention period expires. The deletion hold is only removed after the case is over. The archival software needs to ensure that the archive is tamperproof. Even if the CEO of the company walks up to the system one day in the middle of the night, he should not be able to delete or modify anything. Another major compliance requirement is that the archival software must make it possible to find “relevant” documents in a “reasonable” amount of time. The courts have some definition of what “relevant” and “reasonable” mean in this context, but we’ll not get into that. What it really means for the developers, is that there should be a fairly sophisticated search facility that allows searches by keywords, by regular expressions, and by various fields of the metadata (e.g., find me all documents authored by Basant Rajan from March to September 2008).

e-Discovery

Sadly, just having a compliant archive is no longer good enough. Consider a typical e-discovery scenario. A company is required to produce all emails authored by Basant Rajan pertaining to the volume manager technology in the period March to September 2008. Now just producing all the documents by Basant for that period which contain the words “volume manager” is not good enough. Because he might have referred to it as “VM“. Or he might have just talked about space optimized snapshots without mentioning the words volume manager. So, what happens is that all emails written by Basant in that period are retreived, and a human has to go through each email, to determine if it is relevant to volume manager or not. And this human must be a lawyer. Who charges $4 per email because he has to pay off his law school debt. For a largish company, a typical lawsuit might involve millions of documents. Literally. Now you know why there is so much money in this market. Just producing all documents by Basant and dumping them on the opposing lawyers is not an option. Because the company does not want to disclose to the opposing side anything more than is absolutely necessary. Who knows what other smoking guns are present in Basant’s email? Thus, a way for different archival software vendors to differentiate themselves is the sophistication they can bring to this process. The ability to search for concepts like “volume management” as opposed to the actual keywords. The ability to group/cluster a set of emails by concepts. The ability to allow teams of people to collaboratively work on this job. The ability to search for “all emails which contain a paragraph similar to this paragraph“. If you know how to do this last part, I know a few companies that would be desperate to hire you.

What next?

In Pune, there are at least two companies Symantec, and Mimosa Systems, working in this area. (Mimosa’s President and CEO, T.M. Ravi, is currently in town and will give the keynote for CSI-Pune’s ILM Seminar this Thursday. Might be worth attending if you are interested in this area.) I also believe that CT Summation’s CaseVault system also has a development team here, but I am unable to find any information about that – if you have a contact there, please let me know. For some (possibly outdated) details of the other (worldwide) players in this market, see this report from last year. If you are from one of these companies, and can write an article on what exactly your software does in this field, and how it is better than the others, please let me know. I also had a very interesting discussion with Paul C. Easton of True Legal Partners, an e-Discovery outsourcing firm, where we talked about how they use archiving and e-discovery software, but more generally we also talked about how legal outsourcing to India, suitability of Pune for the same, competition from China, etc. I will write an article on that sometime soon – stay tuned (by subscribing to the PuneTech by email or via RSS)

7 thoughts on “Archival, e-Discovery and Compliance

  1. The over-arching themes here are retention and discovery. The two need not necessarily be linked, but it helps if they are. And so, archival is a tool that enables retention, compliance is sold as a side-effect of the archival process and e-discovery is typically sold as an add-on (on top of the archival systems). This is typically so because the archival system vendors want it that way – it helps them sell their products and also helps customer in making consolidated buying decisions.

    But, it all need not be so. In fact, the market (and user-base also?) for non-archival based e-discovery is much much bigger than archival-based e-discovery software.

    Archival, compliance and e-discovery, or retention and discovery, could be done in isolation. For example, according to an e-discovery standard (EDRM), one step in e-discovery is “data collection” – but it doesn’t specify exactly how the data should be collected. Archival is just one way to do it. Other ways are crawling, document/record management systems, etc. Take the example of Kazeon, which sells e-discovery solutions using its enterprise crawler appliance. Is it legally complete? Maybe not. But for a focussed discovery scenario (litigation or internal inquiry), it should suffice. There are more such examples (Guidance, Index Engines). Or, just look at what EMC is doing with Documentum, EmailXtender and e-discovery. Or, consider e-discovery software like Clearwell or Attenex, which claim to work with a variety of document repositories.

    This is a very complex domain and from what I’ve read and heard, its still a consultant’s and lawyer’s market. They dominate it and are way way ahead of software numbers. Retention is a space where they can’t play much, but can still affect it in some ways. And discovery is where they make hay – its a semi-automatic process (especially “document review”) and, as you said, they charge as much as $4 per email.

    Software (and LPO :-), squeezes-in into discovery by way of claiming to bring that $4 to $2 or even $1. It can effectively automate “data collection” (e.g. archival, crawling), make “data processing” (format conversions, basic filtering, de-dupe, etc.) faster, semi-automate “data review” (using techniques like indexing, concept extraction, threading, clustering, entity extraction, summarization and so on), standardize “data production” and help in “discovery planning”. I don’t know how much software has been able to help and bring down costs in reality, but I know everybody is sure that it can or will. It would be difficult (impossible I guess) to get rid of the human angle, but the goal is to bring it minimum necessary. Otherwise, a lot of our friends in LPO community will start starving ;-).

  2. Rajesh, that is exactly right. Google’s Postini is a significant competitor in this space. Specifically, the fact that it is a hosted solution (i.e. the customer does not have to install and administer any hardware locally) is an important reason why smaller businesses are attracted to it. Its main drawback however, is that it does not really have the sophisticated analytics and workflows required to do eDiscovery well.

    If Google manages to fix that, it should be able to cause significant problems to the leaders in this market.

  3. Ceralon’s Acumen complements Postini in its ability to provide the work flow/audit trail along with the remaining blocks of the EDRM modeletc for an end to end electronic discovery application. Acumen provides the pre-processing, processing, analytics, native file review, tagging, redacting, production and export all with granular security and permissions along with workflow management and audit trail for each document and event through the life cycle of the document.

  4. Capital legal solution’s eZReview provides all necesaary solution like Kewords analytics, Case clustering,Dynamic clustering and email threading in their softwares
    Data collection and Data culling techniques used by this firm has really brought the cost to its minimum

Leave a Reply

Your email address will not be published. Required fields are marked *