Data Leakage Prevention – Overview

A few days ago, we posted a news article on how Reconnex has been named a top leader in Data Leakage Prevention (DLP) technology by Forrester Research. We asked Ankur Panchbudhe of Reconnex, Pune to write an article giving us a background on what DLP is, and why it is important.

Data leakage protection (DLP) is a solution for identifying, monitoring and protecting sensitive data or information in an organization according to policies. Organizations can have varied policies, but typically they tend to focus on preventing sensitive data from leaking out of the organization and identifying people or places that should not have access to certain data or information.

DLP is also known by many other names: information security, content monitoring and filtering (CMF), extrusion prevention, outbound content management, insider thread protection, information leak prevention (ILP), etc.

[edit] Need for DLP

Until a few years ago, organizations thought of data/information security only in terms of protecting their network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the sizes of organizations (e.g. due to globalization), rise in number of data points (machines and servers) and easier modes of communication (e.g. IM, USB, cellphones), accidental or even deliberate leakage of data from within the organization has become a painful reality. This has lead to growing awareness about information security in general and about outbound content management in particular.

Following are the major reasons (and examples) that make an organization think about deploying DLP solutions:

growing cases of data and IP leakages
- for example, credit card numbers and social security numbers being leaked inadvertently or being hacked
regulatory mandates to protect private and personal information
- for example, the case of Monster.com losing over a million private customer records due to phishing
protection of brand value and reputation
- see above example
compliance (e.g. HIPAA, GLBA, SOX, PCI, FERBA)
- for example, Ferrari and McLaren engaging in anti-competitive practices by allegedly stealing internal technical documents
internal policies
- for example, Facebook leaking some pieces of their code
profiling for weaknesses
- Who has access to what data? Is sensitive data lying on public servers? Are employees doing what they are not supposed to do with data?

[edit] Components of DLP

Broadly, the core DLP process has three components: identification, monitoring and prevention.

The first, identification, is a process of discovering what constitutes sensitive content within an organization. For this, an organization first has to define “sensitive”. This is done using policies, which are composed of rules, which in turn could be composed of words, patterns or something more complicated. These rules are then fed to a content discovery engine that “crawls” data sources in the organization for sensitive content. Data sources could include application data like HTTP/FTP servers, Exchange, Notes, SharePoint and database servers, repositories like filers and SANs, and end-user data sources like laptops, desktops and removable media. There could be different policies for different classes of data sources; for example, the policies for SharePoint could try to identify design documents whereas those for Oracle could be tuned to discover credit card numbers. All DLP products ship with pre-defined policy “packages” for well-known scenarios, like PCI compliance, credit card and social security leakage.

The second component, monitoring, typically deployed at the network egress point or on end-user endpoints, is used to flag data or information that should not be going out of the organization. This flagging is done using a bunch of rules and policies, which could be written independently for monitoring purposes, or could be derived from information gleaned during the identification process (previous para). The monitoring component taps into raw data going over the wire, does some (optional) semantic reconstruction and applies policies on it. Raw data can be captured at many levels – network level (e.g. TCP/IP), session level (e.g. HTTP, FTP) or application level (e.g. Yahoo! Mail, GMail). At what level raw data is captured decides whether and how much semantic reconstruction is required. The reconstruction process tries to assemble together fragments of raw data into processable information, on which policies could be applied.

The third component, prevention, is the process of taking some action on the data flagged by the identification or monitoring component. Many types of actions are possible – blocking the data, quarantining it, deleting, encrypting, compressing, notifying and more. Prevention actions are also typically configured using policies and hook into identification and/or monitoring policies. This component is typically deployed along with the monitoring or identification component.

In addition to the above three core components, there is a fourth piece which can be called control. This is basically the component using which the user can [centrally] manage and monitor the whole DLP process. This typically includes the GUI, policy/rule definition and deployment module, process control, reporting and various dashboards.

[edit] Flavors of DLP

DLP products are generally sold in three “flavors”:

Data in motion. This is the flavor that corresponds to a combination of monitoring and prevention component described in previous section. It is used to monitor and control the outgoing traffic. This is the hottest selling DLP solution today.
Data at rest. This is the content discovery flavor that scours an organization’s machines for sensitive data. This solution usually also includes a prevention component.
Data in use. This solution constitutes of agents that run on end-servers and end-user’s laptops or desktops, keeping a watch on all activities related to data. They typically monitor and prevent activity on file systems and removable media options like USB, CDs and Bluetooth.

These individual solutions can be (and are) combined to create a much more effective DLP setup. For example, data at rest could be used to identify sensitive information, fingerprint it and deploy those fingerprints with data in motion and data in use products for an all-scenario DLP solution.

[edit] Technology

DLP solutions classify data in motion, at rest, and in use, and then dynamically apply the desired type and level of control, including the ability to perform mandatory access control that can’t be circumvented by the user. DLP solutions typically:

Perform content-aware deep packet inspection on outbound network communication including email, IM, FTP, HTTP and other TCP/IP protocols
Track complete sessions for analysis, not individual packets, with full understanding of application semantics
Detect (or filter) content that is based on policy-based rules
Use linguistic analysis techniques beyond simple keyword matching for monitoring (e.g. advanced regular expressions, partial document matching, Bayesian analysis and machine learning)

Content discovery makes use of crawlers to find sensitive content in an organization’s network of machines. Each crawler is composed of a connector, browser, filtering module and reader. A connector is a data-source specific module that helps in connecting, browsing and reading from a data source. So, there are connectors for various types of data sources like CIFS, NFS, HTTP, FTP, Exchange, Notes, databases and so on. The browser module lists what all data is accessible within a data source. This listing is then filtered depending on the requirements of discovery. For example, if the requirement is to discover and analyze only source code files, then all other types of files will be filtered out of the listing. There are many dimensions (depending on meta-data specific to a piece of data) on which filtering can be done: name, size, content type, folder, sender, subject, author, dates etc. Once the filtered list is ready, the reader module does the job of actually downloading the data and any related meta-data.

The monitoring component is typically composed of following modules: data tap, reassembly, protocol analysis, content analysis, indexing engine, rule engine and incident management. The data tap captures data from the wire for further analysis (e.g. WireShark aka Ethereal). As mentioned earlier, this capture can happen at any protocol level – this differs from vendor to vendor (depending on design philosophy). After data is captured from the wire, it is beaten into a form that is suitable for further analysis. For example, captured TCP packets could be reassembled into a higher level protocol like HTTP and further into application level data like Yahoo! Mail. After data is into a analyzable form, first level of policy/rule evaluation is done using protocol analysis. Here, the data is parsed for protocol specific fields like IP addresses, ports, possible geographic locations of IPs, To, From, Cc, FTP commands, Yahoo! Mail XML tags, GTalk commands and so on. Policy rules that depend on any such protocol-level information are evaluated at this stage. An example is – outbound FTP to any IP address in Russia. If a match occurs, it is recorded with all relevant information into a database. The next step, content analysis, is more involved: first, actual data and meta-data is extracted out of assembled packet, and then content type of the data (e.g. PPT, PDF, ZIP, C source, Python source) is determined using signatures and rule-base classification techniques (a similar but less powerful thing is “file” command in Unix). Depending on the content type of data, text is extracted along with as much meta-data as possible. Now, content based rules are applied – for example, disallow all Java source code. Again, matches are stored. Depending on the rules, more involved analysis like classification (e.g. Bayesian), entity recognition, tagging and clustering can also be done. The extracted text and meta-data is passed onto the indexing engine where it is indexed and made searchable. Another set of rules, which depend on contents of data, are evaluated at this point; an example: stop all MS Office or PDF files containing the words “proprietary and confidential” with a frequency of at least once per page. The indexing engine typically makes use of an inverted index, but there are other ways also. This index can also be used later to do ad-hoc searches (e.g. for deeper analysis of a policy match). All along this whole process, the rule engine keeps evaluating many rules against many pieces of data and keeping a track of all the matches. The matches are collated into what are called incidents (i.e. actionable events – from an organization perspective) with as much detail as possible. These incidents are then notified or shown to the user and/or also sent to the prevention module for further action.

The prevention module contains a rule engine, an action module and (possibly) connectors. The rule engine evaluates incoming incidents to determine action(s) that needs to be taken. Then the action module kicks in and does the appropriate thing, like blocking the data, encrypting it and sending it on, quarantining it and so on. In some scenarios, the action module may require help from connectors for taking the action. For example, for quarantining, a NAS connector may be used or for putting legal hold, a CAS system like Centera may be deployed. Prevention during content discovery also needs connectors to take actions on data sources like Exchange, databases and file systems.

[edit] Going Further

There are many “value-added” things that are done on top of the functionality described above. These are sometimes sold as separate features or products altogether.

Reporting and OLAP. Information from matches and incidents is fed into cubes and data warehouses so that OLAP and advanced reporting can be done with it.
Data mining. Incident/match information or even stored captured data is mined to discover patterns and trends, plot graphs and generate fancier reports. The possibilities here are endless and this seems to be the hottest field of research in DLP right now.
E-discovery. Here, factors important from an e-discovery perspective are extracted from the incident database or captured data and then pushed into e-discovery products or services for processing, review or production purposes. This process may also involve some data mining.
Learning. Incidents and mined information is used to provide a feedback into the DLP setup. Eventually, this can improve existing policies and even provide new policy options.
Integration with third-parties. For example, integration with BlueCoat provides setups that can capture and analyze HTTPS/SSL traffic.

DLP in Reconnex

Reconnex is a leader in the DLP technology and market. Its products and solutions deliver accurate protection against known data loss and provide the only solution in the market that automatically learns what your sensitive data is, as it evolves in your organization. As of today, Reconnex protects information for more than one million users. Reconnex starts with the protection of obvious sensitive information like credit card numbers, social security numbers and known sensitive files but goes further by storing and indexing upto all communications and upto all content. It is the only company in this field to do so. Capturing all content and indexing it enables organizations to learn what information is sensitive and who is allowed to see it, or conversely who should not see it. Reconnex is also well-known for its unique case management capabilities, where incidents and their disposition can be grouped, tracked and managed as cases.

Reconnex is also the only solution in the market that is protocol-agnostic. It captures data at the network level and reconstructs it to higher levels – from TCP/IP to HTTP, SMTP and FTP to GMail, Yahoo! Chat and Live.

Reconnex offers all three flavors of DLP through its three flagship products: iGuard (data-in-motion), iDiscover (data-at-rest) and Data-in-Use. All its products have consistently been rated high in almost surveys and opinion polls. Industry analysts, Forrester and Gartner, also consider Reconnex a leader in their domain.

About the author: Ankur Panchbudhe is a principal software engineer in Reconnex, Pune. He has more than 6 years of R&D experience in domains of data security, archiving, content management, data mining and storage software. He has 3 patents granted and more than 25 pending in fields of electronic discovery, data mining, archiving, email systems, content management, compliance, data protection/security, replication and storage. You can find Ankur on Twitter.

Related articles:

5 thoughts on “Data Leakage Prevention – Overview”

Joel says:

June 12, 2008 at 2:30 pm

Great article Ankur!
Benjamin Wright says:

June 12, 2008 at 6:45 pm

Ankur: In the US, data leakage is a big legal topic because the law requires firms that lose privacy data to notify the subjects of the data that their data has been compromised. However, we don’t have a good definition of what is and what is not a data leakage. The result is too many notices. I argue it is irresponsible for law and legal practice to bury consumers with an excessive number of data breach notices.

–Ben
http://hack-igations.blogspot.com/2007/12/does-lost-tape-equate-to-lost-data.html
Kyle says:

June 13, 2008 at 4:03 pm

Ankur, really great article and a topic of vast importance. We talk alot about deep packet inspection (DPI) over at our blog, not only from a DLP standpoint, but also in terms of performance of application traffic. This post certainly helped me get a bit smarter about the DLP benefits.

Looking forward to reading more and following you on Twitter!

/kff
Ankur P says:

June 16, 2008 at 4:52 am

Thanks Joel, Kyle.

Ben, you have a good point. Although for an organization it might be possible to exactly define what and what does not constitute data leakage, for law makers it must be very tough. IMHO, such laws are generally written to enforce and maintain a state of paranoia (“prevention is better than cure”).

I agree with you in that data stolen does not necessarily mean that it has also been compromised. But then the question is of striking a balance somewhere – do we loose millions of dollars by being reactive (act only after some data is compromised) or do we spend millions of dollars by being pro-active (buy DLP, anti-malware, anti-phishing and all that)? I think governments and/or organizations generally prefer the latter option to create a general sense of security (or rather an illusion? :-).
sumi says:

June 30, 2008 at 8:19 pm

Very well written article that talks about DLP in a very lucid manner! Thanks

Comments are closed.

punetech.com

Connecting together Pune's Technologists