Category Archives: In Depth

Data Leakage Prevention – Overview

A few days ago, we posted a news article on how Reconnex has been named a top leader in Data Leakage Prevention (DLP) technology by Forrester Research. We asked Ankur Panchbudhe of Reconnex, Pune to write an article giving us a background on what DLP is, and why it is important.

Data leakage protection (DLP) is a solution for identifying, monitoring and protecting sensitive data or information in an organization according to policies. Organizations can have varied policies, but typically they tend to focus on preventing sensitive data from leaking out of the organization and identifying people or places that should not have access to certain data or information.

DLP is also known by many other names: information security, content monitoring and filtering (CMF), extrusion prevention, outbound content management, insider thread protection, information leak prevention (ILP), etc.

[edit] Need for DLP

Until a few years ago, organizations thought of data/information security only in terms of protecting their network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the sizes of organizations (e.g. due to globalization), rise in number of data points (machines and servers) and easier modes of communication (e.g. IM, USB, cellphones), accidental or even deliberate leakage of data from within the organization has become a painful reality. This has lead to growing awareness about information security in general and about outbound content management in particular.

Following are the major reasons (and examples) that make an organization think about deploying DLP solutions:

  • growing cases of data and IP leakages
  • regulatory mandates to protect private and personal information
    • for example, the case of Monster.com losing over a million private customer records due to phishing
  • protection of brand value and reputation
    • see above example
  • compliance (e.g. HIPAA, GLBA, SOX, PCI, FERBA)
    • for example, Ferrari and McLaren engaging in anti-competitive practices by allegedly stealing internal technical documents
  • internal policies
    • for example, Facebook leaking some pieces of their code
  • profiling for weaknesses
    • Who has access to what data? Is sensitive data lying on public servers? Are employees doing what they are not supposed to do with data?

[edit] Components of DLP

Broadly, the core DLP process has three components: identification, monitoring and prevention.

The first, identification, is a process of discovering what constitutes sensitive content within an organization. For this, an organization first has to define “sensitive”. This is done using policies, which are composed of rules, which in turn could be composed of words, patterns or something more complicated. These rules are then fed to a content discovery engine that “crawls” data sources in the organization for sensitive content. Data sources could include application data like HTTP/FTP servers, Exchange, Notes, SharePoint and database servers, repositories like filers and SANs, and end-user data sources like laptops, desktops and removable media. There could be different policies for different classes of data sources; for example, the policies for SharePoint could try to identify design documents whereas those for Oracle could be tuned to discover credit card numbers. All DLP products ship with pre-defined policy “packages” for well-known scenarios, like PCI compliance, credit card and social security leakage.

The second component, monitoring, typically deployed at the network egress point or on end-user endpoints, is used to flag data or information that should not be going out of the organization. This flagging is done using a bunch of rules and policies, which could be written independently for monitoring purposes, or could be derived from information gleaned during the identification process (previous para). The monitoring component taps into raw data going over the wire, does some (optional) semantic reconstruction and applies policies on it. Raw data can be captured at many levels – network level (e.g. TCP/IP), session level (e.g. HTTP, FTP) or application level (e.g. Yahoo! Mail, GMail). At what level raw data is captured decides whether and how much semantic reconstruction is required. The reconstruction process tries to assemble together fragments of raw data into processable information, on which policies could be applied.

The third component, prevention, is the process of taking some action on the data flagged by the identification or monitoring component. Many types of actions are possible – blocking the data, quarantining it, deleting, encrypting, compressing, notifying and more. Prevention actions are also typically configured using policies and hook into identification and/or monitoring policies. This component is typically deployed along with the monitoring or identification component.

In addition to the above three core components, there is a fourth piece which can be called control. This is basically the component using which the user can [centrally] manage and monitor the whole DLP process. This typically includes the GUI, policy/rule definition and deployment module, process control, reporting and various dashboards.

[edit] Flavors of DLP

DLP products are generally sold in three “flavors”:

  • Data in motion. This is the flavor that corresponds to a combination of monitoring and prevention component described in previous section. It is used to monitor and control the outgoing traffic. This is the hottest selling DLP solution today.
  • Data at rest. This is the content discovery flavor that scours an organization’s machines for sensitive data. This solution usually also includes a prevention component.
  • Data in use. This solution constitutes of agents that run on end-servers and end-user’s laptops or desktops, keeping a watch on all activities related to data. They typically monitor and prevent activity on file systems and removable media options like USB, CDs and Bluetooth.

These individual solutions can be (and are) combined to create a much more effective DLP setup. For example, data at rest could be used to identify sensitive information, fingerprint it and deploy those fingerprints with data in motion and data in use products for an all-scenario DLP solution.

[edit] Technology

DLP solutions classify data in motion, at rest, and in use, and then dynamically apply the desired type and level of control, including the ability to perform mandatory access control that can’t be circumvented by the user. DLP solutions typically:

  • Perform content-aware deep packet inspection on outbound network communication including email, IM, FTP, HTTP and other TCP/IP protocols
  • Track complete sessions for analysis, not individual packets, with full understanding of application semantics
  • Detect (or filter) content that is based on policy-based rules
  • Use linguistic analysis techniques beyond simple keyword matching for monitoring (e.g. advanced regular expressions, partial document matching, Bayesian analysis and machine learning)

Content discovery makes use of crawlers to find sensitive content in an organization’s network of machines. Each crawler is composed of a connector, browser, filtering module and reader. A connector is a data-source specific module that helps in connecting, browsing and reading from a data source. So, there are connectors for various types of data sources like CIFS, NFS, HTTP, FTP, Exchange, Notes, databases and so on. The browser module lists what all data is accessible within a data source. This listing is then filtered depending on the requirements of discovery. For example, if the requirement is to discover and analyze only source code files, then all other types of files will be filtered out of the listing. There are many dimensions (depending on meta-data specific to a piece of data) on which filtering can be done: name, size, content type, folder, sender, subject, author, dates etc. Once the filtered list is ready, the reader module does the job of actually downloading the data and any related meta-data.

The monitoring component is typically composed of following modules: data tap, reassembly, protocol analysis, content analysis, indexing engine, rule engine and incident management. The data tap captures data from the wire for further analysis (e.g. WireShark aka Ethereal). As mentioned earlier, this capture can happen at any protocol level – this differs from vendor to vendor (depending on design philosophy). After data is captured from the wire, it is beaten into a form that is suitable for further analysis. For example, captured TCP packets could be reassembled into a higher level protocol like HTTP and further into application level data like Yahoo! Mail. After data is into a analyzable form, first level of policy/rule evaluation is done using protocol analysis. Here, the data is parsed for protocol specific fields like IP addresses, ports, possible geographic locations of IPs, To, From, Cc, FTP commands, Yahoo! Mail XML tags, GTalk commands and so on. Policy rules that depend on any such protocol-level information are evaluated at this stage. An example is – outbound FTP to any IP address in Russia. If a match occurs, it is recorded with all relevant information into a database. The next step, content analysis, is more involved: first, actual data and meta-data is extracted out of assembled packet, and then content type of the data (e.g. PPT, PDF, ZIP, C source, Python source) is determined using signatures and rule-base classification techniques (a similar but less powerful thing is “file” command in Unix). Depending on the content type of data, text is extracted along with as much meta-data as possible. Now, content based rules are applied – for example, disallow all Java source code. Again, matches are stored. Depending on the rules, more involved analysis like classification (e.g. Bayesian), entity recognition, tagging and clustering can also be done. The extracted text and meta-data is passed onto the indexing engine where it is indexed and made searchable. Another set of rules, which depend on contents of data, are evaluated at this point; an example: stop all MS Office or PDF files containing the words “proprietary and confidential” with a frequency of at least once per page. The indexing engine typically makes use of an inverted index, but there are other ways also. This index can also be used later to do ad-hoc searches (e.g. for deeper analysis of a policy match). All along this whole process, the rule engine keeps evaluating many rules against many pieces of data and keeping a track of all the matches. The matches are collated into what are called incidents (i.e. actionable events – from an organization perspective) with as much detail as possible. These incidents are then notified or shown to the user and/or also sent to the prevention module for further action.

The prevention module contains a rule engine, an action module and (possibly) connectors. The rule engine evaluates incoming incidents to determine action(s) that needs to be taken. Then the action module kicks in and does the appropriate thing, like blocking the data, encrypting it and sending it on, quarantining it and so on. In some scenarios, the action module may require help from connectors for taking the action. For example, for quarantining, a NAS connector may be used or for putting legal hold, a CAS system like Centera may be deployed. Prevention during content discovery also needs connectors to take actions on data sources like Exchange, databases and file systems.

[edit] Going Further

There are many “value-added” things that are done on top of the functionality described above. These are sometimes sold as separate features or products altogether.

  • Reporting and OLAP. Information from matches and incidents is fed into cubes and data warehouses so that OLAP and advanced reporting can be done with it.
  • Data mining. Incident/match information or even stored captured data is mined to discover patterns and trends, plot graphs and generate fancier reports. The possibilities here are endless and this seems to be the hottest field of research in DLP right now.
  • E-discovery. Here, factors important from an e-discovery perspective are extracted from the incident database or captured data and then pushed into e-discovery products or services for processing, review or production purposes. This process may also involve some data mining.
  • Learning. Incidents and mined information is used to provide a feedback into the DLP setup. Eventually, this can improve existing policies and even provide new policy options.
  • Integration with third-parties. For example, integration with BlueCoat provides setups that can capture and analyze HTTPS/SSL traffic.

DLP in Reconnex

Reconnex is a leader in the DLP technology and market. Its products and solutions deliver accurate protection against known data loss and provide the only solution in the market that automatically learns what your sensitive data is, as it evolves in your organization. As of today, Reconnex protects information for more than one million users. Reconnex starts with the protection of obvious sensitive information like credit card numbers, social security numbers and known sensitive files but goes further by storing and indexing upto all communications and upto all content. It is the only company in this field to do so. Capturing all content and indexing it enables organizations to learn what information is sensitive and who is allowed to see it, or conversely who should not see it. Reconnex is also well-known for its unique case management capabilities, where incidents and their disposition can be grouped, tracked and managed as cases.

Reconnex is also the only solution in the market that is protocol-agnostic. It captures data at the network level and reconstructs it to higher levels – from TCP/IP to HTTP, SMTP and FTP to GMail, Yahoo! Chat and Live.

Reconnex offers all three flavors of DLP through its three flagship products: iGuard (data-in-motion), iDiscover (data-at-rest) and Data-in-Use. All its products have consistently been rated high in almost surveys and opinion polls. Industry analysts, Forrester and Gartner, also consider Reconnex a leader in their domain.

About the author: Ankur Panchbudhe is a principal software engineer in Reconnex, Pune. He has more than 6 years of R&D experience in domains of data security, archiving, content management, data mining and storage software. He has 3 patents granted and more than 25 pending in fields of electronic discovery, data mining, archiving, email systems, content management, compliance, data protection/security, replication and storage. You can find Ankur on Twitter.

Related articles:

Zemanta Pixie

Archival, e-Discovery and Compliance

Archival of e-mails and other electronic documents, and the use of such archives in legal discovery is an emerging and exciting new field in enterprise data management. There are a number of players in this area in Pune, and it is, in general, a very interesting and challenging area. This article gives a basic background. Hopefully, this is just the first in a series of articles and future posts will delve into more details.

Background

In the US, and many other countries, when a company files a lawsuit against another, before the actual trial takes place, there is a pre-trial phase called “discovery”. In this phase, each side asks the other side to produce documents and other evidence relating to specific aspects of the case. The other side is, of course, not allowed to refuse and must produce the evidence as requested. This discovery also applies to electronic documents – most notably, e-mail. Now unlike other documents, electronic documents are very easy to delete. And when companies were asked to produce certain e-mails in court as part of discovery, more and more of them began claiming that relevant e-mails had already been deleted, or they were unable to find the e-mails in their backups. The courts are not really stupid, and quickly decided that the companies were lying in order to avoid producing incriminating evidence. This gave rise to a number of laws in recent times which specify, essentially, that relevant electronic documents cannot be deleted for a certain number of years, and that they should be stored in an easily searchable archive, and that failure to produce these documents in a reasonable amount of time is a punishable offense. This is terrible news for most companies, because now, all “important” emails (for a very loose definition of “important”) must be stored for many years. Existing backup systems are not good enough, because those are not really searchable by content. And the archives cannot be stored on cheap tapes either, because those are not searchable. Hence, they have to be on disk. This is a huge expense. And a management nightmare. But failure to comply is even worse. There have been actual instances of huge fines (millions of dollars, and in once case, a billion dollars) imposed by courts on companies that were unable to produce relevant emails in court. In some cases, the company executives were slapped with personal fines (in addition to fines on the company). On the other hand, this is excellent news for companies that sell archival software that helps you store electronic documents for the legally sufficient number of years in a searchable repository. The demand for this software, driven by draconian legal requirments, is HUGE, and an entire industry has burgeoned to service this market. Just e-mail archival soon be a billion dollar market. (Update: Actually it appears that in 2008, archival software alone is expected to touch 2 billion dollars with a growth rate of 47% per year, and e-discovery and litigation support software market will be 4 billion growing at 27%. And this doesn’t count the e-discovery services market which is much much larger.) There are three major chunks to this market:

  • Archival – Ability to store (older) documents for a long time on cheaper disks in a searchable repository
  • Compliance – Ensuring that the archival store complies with all the relevant laws. For example, the archive must be tamperproof.
  • e-Discovery – The archive should have the required search and analysis tools to ensure that it is easy to find all the relevant documents required in discovery

Archival

Archival software started its life before the advent of these compliance laws. Basic email archival is simply a way to move all your older emails out of your expensive MS exchange database, into cheaper, slower disks. And shortcuts are left in the main Exchange database so that if the user ever wants to access one of these older emails, it is fetched on demand from the slower archival disks. This is very much like a page fault in virtual memory. The net effect is that for most practical purposes, you’ve increased the sizes of peoples’ mailboxes without a major increase in price, without a decrease in performance for recent emails, and some decrease in performance for older emails. Unfortunately, these guys had only middling success. Think about it – if your IT department is given a choice between spending money on an archival software that will allow them to increase your mailbox size, or simply telling all you users to learn to live with small mailbox sizes, what would they choose? Right. So the archival software companies saw only moderate growth. All of this changed when the e-discovery laws came into effect. Suddenly, archival became a legal requirement instead of a good-to-have bonus feature. Sales exploded. Startups started. And it also added a bunch of new requirements, described in the next two sections.

Compliance

Before I start, I should note that “IT Compliance” in general is really a huge area and includes all kinds of software and services required by IT to comply with any number of laws (like Sarbanes Oxley for accounting, HIPAA for medical records, etc.) That is not the compliance I am referring to in this article. Here we only deal with compliance as it pertains to archival software. The major compliance requirement is that for different types of e-mails and electronic documents, the laws prescribe the minimum number of years for which they must be retained by the company. And, no company wants to really keep any of these documents a day more than is minimally required by the law. Hence, for each document, the archival software must maintain the date until which the document must be retained, and on that day, it must automatically delete that document. Except, if the document is “relevant” to a case that is currently running. Then the document cannot be deleted until the case is over. This allows us to introduce the concept of a legal hold (or a deletion hold) that is placed on a document or a set of documents as soon as it is determined that it is relevant to a current case. The archival software ensures that documents with a deletion hold are not deleted even if their retention period expires. The deletion hold is only removed after the case is over. The archival software needs to ensure that the archive is tamperproof. Even if the CEO of the company walks up to the system one day in the middle of the night, he should not be able to delete or modify anything. Another major compliance requirement is that the archival software must make it possible to find “relevant” documents in a “reasonable” amount of time. The courts have some definition of what “relevant” and “reasonable” mean in this context, but we’ll not get into that. What it really means for the developers, is that there should be a fairly sophisticated search facility that allows searches by keywords, by regular expressions, and by various fields of the metadata (e.g., find me all documents authored by Basant Rajan from March to September 2008).

e-Discovery

Sadly, just having a compliant archive is no longer good enough. Consider a typical e-discovery scenario. A company is required to produce all emails authored by Basant Rajan pertaining to the volume manager technology in the period March to September 2008. Now just producing all the documents by Basant for that period which contain the words “volume manager” is not good enough. Because he might have referred to it as “VM“. Or he might have just talked about space optimized snapshots without mentioning the words volume manager. So, what happens is that all emails written by Basant in that period are retreived, and a human has to go through each email, to determine if it is relevant to volume manager or not. And this human must be a lawyer. Who charges $4 per email because he has to pay off his law school debt. For a largish company, a typical lawsuit might involve millions of documents. Literally. Now you know why there is so much money in this market. Just producing all documents by Basant and dumping them on the opposing lawyers is not an option. Because the company does not want to disclose to the opposing side anything more than is absolutely necessary. Who knows what other smoking guns are present in Basant’s email? Thus, a way for different archival software vendors to differentiate themselves is the sophistication they can bring to this process. The ability to search for concepts like “volume management” as opposed to the actual keywords. The ability to group/cluster a set of emails by concepts. The ability to allow teams of people to collaboratively work on this job. The ability to search for “all emails which contain a paragraph similar to this paragraph“. If you know how to do this last part, I know a few companies that would be desperate to hire you.

What next?

In Pune, there are at least two companies Symantec, and Mimosa Systems, working in this area. (Mimosa’s President and CEO, T.M. Ravi, is currently in town and will give the keynote for CSI-Pune’s ILM Seminar this Thursday. Might be worth attending if you are interested in this area.) I also believe that CT Summation’s CaseVault system also has a development team here, but I am unable to find any information about that – if you have a contact there, please let me know. For some (possibly outdated) details of the other (worldwide) players in this market, see this report from last year. If you are from one of these companies, and can write an article on what exactly your software does in this field, and how it is better than the others, please let me know. I also had a very interesting discussion with Paul C. Easton of True Legal Partners, an e-Discovery outsourcing firm, where we talked about how they use archiving and e-discovery software, but more generally we also talked about how legal outsourcing to India, suitability of Pune for the same, competition from China, etc. I will write an article on that sometime soon – stay tuned (by subscribing to the PuneTech by email or via RSS)

A Vision for e-Governance in Pune

In an earlier article, I wrote about how Pune now has a CIO, who is pushing various initiatives to make Pune the city with the best use of technology for governance.

At my request, Dr. Anupam Saraph, the CIO of Pune, has written two articles about this aspect of his work. The first one is a vision piece painting a picture of Pune in 2015. An excerpt:

The pain of providing the same information over and over at different counters is history. The first time I registered myself to ilife, through my computer at home, I was asked to provide information to identify myself. I was requested to visit any one of the 14 ward offices to provide a photograph and my thumbprint to receive my Pune-card, my username and a password to access ilife. That was it.

My Pune-card provides me with cashless bus-travel, parking and entry into all electronic access public locations as well as electronic entry enabled private locations. It works as a cash-card and also replaces time-consuming procedures with countless forms to make applications. It simplifies and secures transactions as I can simply allow the service providers to swipe my card and take my thumbprint to access information. Only information that I have marked as allow through Pune-card will be accessed at points-of-transaction. The transaction is updated in my account on ilife.

If you read the whole article, you’ll notice that none of the ideas contained there are futuristic, or taken from sci-fi. They are all things that can be implemented relatively easily using today’s technology. All that is needed is execution and political will. And there are indications that the political will is there.

While a vision statement might be good as an inspiration, it is worthless without concrete short-term goals and projects. Dr. Saraph has written another article that lists some of the specific projects that are already underway. There is already industry interest for some of these projects, for example, Unwire Pune, and Pune Cards. Others, like Design for Pune and MyWard, will depend more upon community participation.

This is where you come in. All of these projects can do with help. From web-design and usability, to server and database tuning. Or, if you are a non-technology person, you can help with spreading the word, or simply by participating. I am planning to start a discussion on these topics at IdeaCampPune tomorrow (Saturday). Dr. Saraph will also try and attend those discussions. (Registration for that event is now closed, so you will not be able to attend unless you’ve already registered. However, if there is a good discussion, and any concrete actions result from it, I’ll write an article on that in the next week. Stay tuned. If you’ve already registered, please note that the venue has shifted to Persistent’s Aryabhatta facility near Nal Stop.)

SEAP is already behind these initiatives (in fact, the appointment of Anupam Saraph is a joint partnership between PMC, SEAP, Dr. Saraph.) Civic commissioner Praveensinh Pardeshi is very supportive of the project. Companies like Persistent, Eclipsys, nVidia have already pitched in by providing free manpower or resources.

But given the scope of the project, more volunteers are welcome. I have already committed to spending some time every week on projects that can use my expertise, like Design for Pune and MyWard.

It is very easy to get cynical about any projects undertaken by the government. Especially PMC. And that was my first reaction too. However, I have now come to believe that a few people can make a difference. Participate. Enthusiastically. Passionately. Try to convince your friends. One out of 50 will join you. That might be enough. Isn’t it worth trying?

Related articles:

PMC vision for the future needs your help

Pune now has has a CIO – whose job it is to guide all use of information technology related to PMC. This includes external facing services like property tax payments, marriage/birth/death registration, and also internal use of IT like MIS and ERP. Dr. Anupam Saraph, who has been appointed the CIO of PMC, is an industry veteran with a good understanding of the latest trends in both technology and e-governance. As a result, his vision for PMC goes far beyond simple computerization of services – this includes initiatives to encourage citizen participation through the use of wikis and social networking, games and competitions to increase citizen involvement, use of maps, GIS, and mashups to increase usability and usefulness of the services and websites.

However, I don’t think this is something that can be done without active community participation. For really successful implementation of some of these ideas, what is really needed, in my opinion, is the involvement of the tech community to help with the execution – frontends, backends, usability, evangelization. I would like to start a discussion on how we can help.

Dr. Saraph has agreed to attend IdeaCampPune for a few hours in the first half o the day. If we can get a few discussions started around this topic, he can participate, clarify his vision for us, and answer questions. I have also requested him to write an article giving some more details on his ideas and initiatives, so we can start thinking about how best the community can help in each of those areas. He hopes to have it done by Monday or Tuesday, and I’ll post it here as soon as I get it. Please check this site again on Tuesday. (Or better yet, subscribe to the RSS feed or email updates.)

If you have any immediate questions or suggestions please post them in the comments below, and I can have Dr. Saraph answer them.

Related articles:
Upcoming Events: IdeaCampPune
PMC to re-charge Pune wi-fi project
Pune Municipal Corporation gets CIO, new website, wiki

Building EKA – The world’s fastest privately funded supercomputer

Eka, built by CRL, Pune is the world’s 4th fastest supercomputer, and the fastest one that didn’t use government funding. This is the same supercomputer referenced in Yahoo!’s recent announcement about cloud computing research at the Hadoop Summit. This article describes some of the technical details of Eka’s design and implementation. It is based on a presentation by the Eka architects conducted by CSI Pune and MCCIA Pune.

Interconnect architecture

The most important decision in building a massively parallel supercomputer is the design of how the different nodes (i.e. processors) of the system are connected together. If all nodes are connected to each other, parallel applications scale really well (linear speedup), because communication between nodes is direct and has no bottlenecks. But unfortunately, building larger and larger such systems (i.e. ones with more and more nodes) becomes increasingly difficult and expensive because the complexity of the interconnect increases as n2. To avoid this, supercomputers have typically used sparse interconnect topologies like Star, Ring, Torus (e.g. IBM’s Blue Gene/L), or hypercube (Cray). These are more scalable as far as building the interconnect for really large numbers of nodes is concerned. However, the downside is that nodes are not directly connected to each other and messages have to go through multiple hops before reaching the destination. Here, unless the applications are designed very carefully to reduce message exchanges between different nodes (especially those that are not directly connected to each other), the interconnect becomes a bottleneck for application scaling.

In contrast to those systems, Eka uses an interconnect designed using concepts from projective geometry. The details of the interconnect are beyond the scope of this article. (Translation: I did not understand the really complex mathematics that goes on in those papers. Suffice it to say that before they are done, fairly obscure branches of mathematics get involved. However, one of these days, I am hoping to write a fun little article on how a cute little mathematical concept called Perfect Difference Sets (first described in 1938) plays an important role in designing supercomputer interconnects over 50 years later. Motivated readers are encouraged to try and see the connection.)

To simplify – Eka uses an interconnect based on Projective Geometry concepts. This interconnect gives linear speedup for applications but the complexity of building the interconnect increases only near-linearly.

The upshot of this is that to achieve a given application speed (i.e. number of Teraflops), Eka ends up using fewer nodes than its compatriots. This means it that it costs less and uses less power, both of which are major problems that need to be tackled in designing a supercomputer.

Handling Failures

A computer that includes 1000s of processors, 1000s of disks, and 1000s of network elements soon finds itself on the wrong side of the law of probability as far as failures are concerned. If one component of a system has a MTBF (mean time between failures) of 10000 hours, and the system has 3000 components, then you can start expecting things to fail once every 10 hours. (I know that the math in that sentence is probably not accurate, but ease of understanding trumps accuracy in most cases.)

If an application is running on 500 nodes, and has been running for the last 20 hours, and one of the nodes fails, the entire application has to be restarted from scratch. And this happens often, especially before an important deadline.

A simple solution is to save the state of the entire application every 15 minutes. This is called checkpointing. When there is a failure, the system is restarted from the last checkpoint and hence ends up losing only 15 minutes of work. While this works well enough, it can get prohibitively expensive. If you spend 5 minutes out of every 15 minutes in checkpointing your application, then you’ve effectively reduced the capacity of your supercomputer by 33%. (Another way of saying the same thing is that you’ve increased your budget by 50%.)

The projective geometry architecture also allows for a way to partition the compute nodes in such a way that checkpointing and status saving can be done only for a subset of the nodes involved. The whole system need not be reset in case of a failure – only the related subset. In fact, with the projective geometry architecture, this can be done in a provably optimal manner. This results in improved efficiency. Checkpoints are much cheaper/faster, and hence can be taken more frequently. This means that the system can handle failures much better.

Again, I don’t understand the details of how projective geometry helps in this – if someone can explain that easily in a paragraph or two, please drop me a note.

The infrastructure

The actual supercomputer was built in just 6 weeks. However, other aspects took much longer. It took an year of convincing to get the project funded. And another year to build the physical building and the rest of the infrastructure. Eka uses

  • 2.5MW of electricity
  • 400ton cooling capacity
  • 10km of electrical cabling
  • 10km of ethernet cabling
  • 15km of infiniband cabling

The computing infrastructure itself consists of:

  • 1800 blades, 4 cores each. 3Ghz for each core.
  • HP SFS clusters
  • 28TB memory
  • 80TB storage. Simple SATA disks. 5.2Gbps throughput.
  • Lustre distributed file-system
  • 20Gbps infiniband DDR. Eka was on the cutting edge of Infiniband technology. They sourced their infiniband hardware from an Israeli company and were amongst the first users of their releases – including beta, and even alpha quality stuff.
  • Multiple Gigabit ethernets
  • Linux is the underlying OS. Any Linux will work – RedHat, SuSe, your favorite distribution.

Its the software, stupid!

One of the principles of the Eka project is to be the one-stop shop for tackling problems that require huge amounts of computational powers. Their tagline for this project has been: from atoms to applications. They want to ensure that the project takes care of everything for their target users, from the hardware all the way up to the application. This meant that they had to work on:

  • High speed low latency interconnect research
  • System architecture research
  • System software research – compilers etc.
  • Mathematical library development
  • Large scientific problem solving.
  • Application porting, optimization and development.

Each of the bullet items above is a non-trivial bit of work. Take for example “Mathematical library development.” Since they came up with a novel architecture for the interconnect for Eka, all parallel algorithms that run on Eka also have to be adapted to work well with the architecture. To get the maximum performance out of your supercomputer, you have to rewrite all your algorithms to take advantages of the strengths of your interconnect design while avoiding the weaknesses. Requiring users to understand and code for such things has always been the bane of supercomputing research. Instead, the Eka team has gone about providing mathematical libraries of the important functions that are needed by applications specifically tailored to the Eka architecture. This means that people who have existing applications can run them on Eka without major modifications.

Applications

Of the top 10 supercomputers in the world, Eka is the only system that was fully privately funded. All other systems used government money, so all of them are for captive use. This means that Eka is the only system in the top 10 that is available for commercial use without strings attached.

There are various traditional applications of HPC (high-performance computing) (which is what Eka is mainly targeted towards):

  • Aerodynamics (Aircraft design). Crash testing (Automobile design)
  • Biology – drug design, genomics
  • Environment – global climate, ground water
  • Applied physics – radiation transport, supernovae, simulate exploding galaxies.
  • Lasers and Energy – combustion, ICF
  • Neurobiology – simulating the brain

But as businesses go global and start dealing with huge quantities of data, it is believed that Eka-like capabilities will soon be needed to tackle these business needs:

  • Integrated worldwide supply chain management
  • Large scale data mining – business intelligence
  • Various recognition problems – speech recognition, machine vision
  • Video surveillance, e-mail scanning
  • Digital media creation – rendering; cartoons, animation

But that is not the only output the Tatas expect from their investment (of $30 million). They are also hoping to tap the expertise gained during this process for consulting and services:

  • Consultancy: Need & Gap Analysis and Proposal Creation
  • Technology: Architecture & Design & Planning of high performance systems
  • Execution: Implement, Test and Commissioning of high performance system
  • Post sales: HPC managed services, Operations Optimization, Migration Services
  • Storage: Large scale data management (including archives, backups and tapes), Security and Business Continuity
  • Visualization: Scalable visualization of large amounts of data

and more…

This article is based on a presentation given by Dr. Sunil Sherlekar, Dr. Rajendra Lagu, and N. Seetha Rama Krishna, of CRL, Pune, who built Eka. For details of their background, see here. However, note that I’ve filled in gaps in my notes with my own conjectures, so errors in the article, if any, should be attributed to me.

LordsOfOdds – Betting on Prediction Markets

LordsOfOdds is a startup based on the concept of prediction markets. It enables users to “bet” on the outcome of Indian events in sports, politics or entertainment. But betting is just a small part of it; the prediction market concept has a far greater potential as a source of information. Did you know that Rahul Gandhi has a 44% chance of becoming the congress prime mininsterial candidate?

What is a prediction market

A prediction market is like a stock market, except that instead of buying or selling stocks in a company, you are buying or selling “stocks” in a prediction. The prediction can be something like “India will win more than 3 gold medals at the 2008 Olympics”. At the end of the 2008 Olympics, if India actually wins 3 or more medals, each stock will pay out a dividend of 100 units. Otherwise, you get 0 units. After this point, this particular stock ceases to exist.

However, before the 2008 Olympics, nobody knows for sure whether India will win 3 medals or not. Hence, it is not clear whether the price of the stock will be 100 units or 0 units. Somebody who thinks that the probability of the prediction coming true is about 30%, should be willing to pay about 30 units for each stock. Someone else who thinks that the probability is 60% should be willing to pay 60 units for each stock. And the guy who bought at 30 units should be happy to sell it to the guy who is willing to pay 60.

As new information becomes available, the price goes up or down. If Anju Bobby George gets injured, the probability of winning 3 medals goes down and the stock will fall. If KPS Gill is removed as IHF chief, the stock will probably go up.

LordsOfOdds has a really nice, detailed example if you want to understand it better.

Why bother?

At the very least, the game can be a lot of fun once you get hooked.

But there is more. Prediction markets are actually useful to get an idea of what the crowd thinks of the probability of success of any prediction. (The current stock price of the prediction directly gives the predicted success percentage.) Combine this with research that says that the average wisdom of the crowds is as good as that of highly-paid experts, and you can suddenly see how a prediction market is a great tool for getting “expert opinion” on any topic.

This has been found to be rather useful in the context of large corporations. Anybody who has worked in one, can attest to the fact that communication and information flow are rather pathetic and nobody really has an idea of what is going on. Enter prediction markets. They actually become sneaky way of getting information out of the employees while they think they are playing a fun game.

HP started using prediction markets internally in sales forecasting. Now they use prediction markets in several business units. Intel uses them in relation to manufacturing capacity. Google uses them to forecast product launch dates, new office openings, and many other things of strategic importance to Google. Microsoft uses it to predict number of bugs in a software package. GE uses it to generate new product ideas from employees. (Sources: wikipedia, and LordsOfOdds)

LordsOfOdds – What’s there now

Currently, LordsOfOdds is being pitched more as an online betting site. You pick favorites in sports events (“Sachin Tendulkar will be man of the match for the first India-South Africa Test” is trading at 23.5), or entertainment (“The movie Race will be a hit” is trading at 34.4), or politics (“Rahul Gandhi to be Congress Prime Ministerial candidate” is trading at 44).

And since betting is illegal in India, the site operates using a virtual currency (“Loots”). And the only thing players get out of the exercise is bragging rights. But stay tuned, because they are building a prize inventory.

LordsOfOdds – The future

I am hoping to get answers from the founders on the following questions:

  • In the future, are you planning on using the predictions that your market produces in some way?
  • What is your monetization strategy?

The answers to these should be interesting.

LordsOfOdds – The founders

Rajesh Kallidumbil who was working in London, and Siddhartha Saha who was in Hyderabad debated the idea for over a month on Google Talk and then decided to start it up in Pune. Hariharan K who has a tremendous passion for Sports, decided that his job as an Investment Banking analyst wasn’t half as interesting as LordsOfOdds and joined the team.

Read the company blog if you want to follow their progress. LordsOfOdds has also been covered by StartupDunia.

InfoBeanz: Free web-based platform for “digital signage”

Have you seen the TV screens at McDonalds or Inox that are showing advertisements? Have you ever wondered what exactly it takes to set up a system like this – in terms of software, hardware, and how much it costs?

Well, I don’t know the general answer to that question, but Pune-based company InfoBeanz is trying to ensure that people don’t need to ever find out. Because they have just released a web-based software platform for “digital signage” that that allows anybody to this using any old computer and monitor (or in a pinch, even an old TV screen will do). No software download is required. Just upload the content that needs to be shown on the screen to InfoBeanz website from a regular internet browser (Windows XP+ and IE6+ only). Then hookup the screen to any computer (windows or linux) that is running any browser (IE or firefox) and point it towards the InfoBeanz site. The InfoBeanz webpage will display ads (or whatever the customer wants) on the screen.

All of this is available to anybody free of cost. Basically, InfoBeanz is trying to democratize the process of digital signage. According to their press release:

Globally players in this segment are charging a hefty price for their digital signage solution and licenses.

InfoSignz plans to serve the largest and the smallest of the digital signage customers across the world and aims to break the entry barrier of cost and proprietary hardware.

So how does InfoBeanz plan to make any money out of this venture? The standard open source model. From their FAQ:

There are various revenue models that we will earn money from. One of them is advertisements on the network. Another is paid premium subscription services.

The paid service will have enhanced file, playlist and location management features. Apart from that, the paid service will also have enhanced interactivity features.

The paid service will also be able to connect to the inventory backend of the customer. Consider this:

What use is it to keep on selling something that is not in stock? I am frustrated when there is a display in a store selling a 27″ TV for $149.99 but when I make up my mind to buy (after much haggling with my wife) the item is out of stock. The marketer was successful in capturing the moment of truth, but the supply chain guy missed out because the two of them did not talk after every piece was sold. The marketer not only lost out on selling something that is not even available, he could have shown something else and lost out on selling something which was readily available. Double whammy!!
[…]
How nice would it be if the display stopped showing the promotion related to the television when the TV went out of stock? Wouldn’t it be even better if the display started promoting something that was in stock?

When all the other systems are interconnected and act intelligently, why should the digital display network be treated poorly?

(From the CEO’s blog)

In general, I think this announcement is very cool from a number of perspectives. It is a new and disruptive way to enter into a field dominated by expensive and proprietary solutions. It is a leap of faith to be able to release a free product and hope that you can figure out how to make money later. It is also technically challenging to be able to deliver on the promise of “no proprietary hardware and no installation of software required”. And finally, scaling to the demands of all the freeloaders who will want to use this service will also be a challenge.

CaptainPad: Automating restaurant order processing

You must be the change you wish to see in the world

— Mahatma Gandhi

Pune based brothers Abhay and Vijay Badhe appear to have taken this quote to heart. They got tired of the long delays in restaurants in getting tables, food and even the bills. So, instead of just whining about it, they went and built a business that sells solutions to restaurants that fixes the problem. Two years ago, they formed Wings iNet, with the intention of of providing “complete end-to-end solutions to hotel and restaurant industry through its innovative approach and use of the latest cutting edge technology”. Its first product is CaptainPad a wireless system that uses handheld devices for automating restaurant ordering systems.

CaptainPad workflow

In Pune restaurants like Rajwada, Green Park, who are now users of CaptainPad, gone are the days where waiter/captain takes order on a paper pad, physically takes it to billing counter and a kitchen. Now the captain carries a smart wireless touch pad device. The whole menu card is loaded in the CaptainPad device. Captain can now send the order from his device wirelessly to the kitchen and billing station. KOT (Kitchen Order Ticket) and billing information will generate instantaneously. Captain need not go to anywhere.

Cooks see a printout of the order immediately after the order is taken. No delay. Also, the system automatically routes the appropriate orders to the appropriate cooks. Waiters also have devices which tell them which dishes to take where. When the customer wants the bill, the captain just clicks an icon on his screen and the bill is printed.

Sounds great in theory, but does it work in practice? I went to Green Park this weekend and to me it seemed that the processing was indeed faster. (But this could simply be a case of confirmation bias.) In any case, the owners/managers of the restaurant seem to be happy with the system. Processing is faster, customers are happier, and for those owners who have a GPRS cellphone, the complete information about the restaurant operations would be available on his mobile device, irrespective of the location. Restaurants are reporting that after the installation of the CaptainPad, the table recurrence ratio has also gone up – i.e. they serve more customers per table per day than earlier.

But what about the actual grunts who have to deal with the system – the captains and the waiters? I tried enticing one of them to complain about the system – to see if there were problems, and whether he preferred the older way of doing things. No dice. He was quite happy with the system.

For a large restaurant, this system appears to be a good investment at Rs 2.5 lakh to Rs 6 lakh depending on the number of captains which the restaurant has. No wonder Wings iNet is hoping to have a 100 customers by the end of the year.