Monthly Archives: June 2008

Data Leakage Prevention – Overview

A few days ago, we posted a news article on how Reconnex has been named a top leader in Data Leakage Prevention (DLP) technology by Forrester Research. We asked Ankur Panchbudhe of Reconnex, Pune to write an article giving us a background on what DLP is, and why it is important.

Data leakage protection (DLP) is a solution for identifying, monitoring and protecting sensitive data or information in an organization according to policies. Organizations can have varied policies, but typically they tend to focus on preventing sensitive data from leaking out of the organization and identifying people or places that should not have access to certain data or information.

DLP is also known by many other names: information security, content monitoring and filtering (CMF), extrusion prevention, outbound content management, insider thread protection, information leak prevention (ILP), etc.

[edit] Need for DLP

Until a few years ago, organizations thought of data/information security only in terms of protecting their network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the sizes of organizations (e.g. due to globalization), rise in number of data points (machines and servers) and easier modes of communication (e.g. IM, USB, cellphones), accidental or even deliberate leakage of data from within the organization has become a painful reality. This has lead to growing awareness about information security in general and about outbound content management in particular.

Following are the major reasons (and examples) that make an organization think about deploying DLP solutions:

  • growing cases of data and IP leakages
  • regulatory mandates to protect private and personal information
    • for example, the case of Monster.com losing over a million private customer records due to phishing
  • protection of brand value and reputation
    • see above example
  • compliance (e.g. HIPAA, GLBA, SOX, PCI, FERBA)
    • for example, Ferrari and McLaren engaging in anti-competitive practices by allegedly stealing internal technical documents
  • internal policies
    • for example, Facebook leaking some pieces of their code
  • profiling for weaknesses
    • Who has access to what data? Is sensitive data lying on public servers? Are employees doing what they are not supposed to do with data?

[edit] Components of DLP

Broadly, the core DLP process has three components: identification, monitoring and prevention.

The first, identification, is a process of discovering what constitutes sensitive content within an organization. For this, an organization first has to define “sensitive”. This is done using policies, which are composed of rules, which in turn could be composed of words, patterns or something more complicated. These rules are then fed to a content discovery engine that “crawls” data sources in the organization for sensitive content. Data sources could include application data like HTTP/FTP servers, Exchange, Notes, SharePoint and database servers, repositories like filers and SANs, and end-user data sources like laptops, desktops and removable media. There could be different policies for different classes of data sources; for example, the policies for SharePoint could try to identify design documents whereas those for Oracle could be tuned to discover credit card numbers. All DLP products ship with pre-defined policy “packages” for well-known scenarios, like PCI compliance, credit card and social security leakage.

The second component, monitoring, typically deployed at the network egress point or on end-user endpoints, is used to flag data or information that should not be going out of the organization. This flagging is done using a bunch of rules and policies, which could be written independently for monitoring purposes, or could be derived from information gleaned during the identification process (previous para). The monitoring component taps into raw data going over the wire, does some (optional) semantic reconstruction and applies policies on it. Raw data can be captured at many levels – network level (e.g. TCP/IP), session level (e.g. HTTP, FTP) or application level (e.g. Yahoo! Mail, GMail). At what level raw data is captured decides whether and how much semantic reconstruction is required. The reconstruction process tries to assemble together fragments of raw data into processable information, on which policies could be applied.

The third component, prevention, is the process of taking some action on the data flagged by the identification or monitoring component. Many types of actions are possible – blocking the data, quarantining it, deleting, encrypting, compressing, notifying and more. Prevention actions are also typically configured using policies and hook into identification and/or monitoring policies. This component is typically deployed along with the monitoring or identification component.

In addition to the above three core components, there is a fourth piece which can be called control. This is basically the component using which the user can [centrally] manage and monitor the whole DLP process. This typically includes the GUI, policy/rule definition and deployment module, process control, reporting and various dashboards.

[edit] Flavors of DLP

DLP products are generally sold in three “flavors”:

  • Data in motion. This is the flavor that corresponds to a combination of monitoring and prevention component described in previous section. It is used to monitor and control the outgoing traffic. This is the hottest selling DLP solution today.
  • Data at rest. This is the content discovery flavor that scours an organization’s machines for sensitive data. This solution usually also includes a prevention component.
  • Data in use. This solution constitutes of agents that run on end-servers and end-user’s laptops or desktops, keeping a watch on all activities related to data. They typically monitor and prevent activity on file systems and removable media options like USB, CDs and Bluetooth.

These individual solutions can be (and are) combined to create a much more effective DLP setup. For example, data at rest could be used to identify sensitive information, fingerprint it and deploy those fingerprints with data in motion and data in use products for an all-scenario DLP solution.

[edit] Technology

DLP solutions classify data in motion, at rest, and in use, and then dynamically apply the desired type and level of control, including the ability to perform mandatory access control that can’t be circumvented by the user. DLP solutions typically:

  • Perform content-aware deep packet inspection on outbound network communication including email, IM, FTP, HTTP and other TCP/IP protocols
  • Track complete sessions for analysis, not individual packets, with full understanding of application semantics
  • Detect (or filter) content that is based on policy-based rules
  • Use linguistic analysis techniques beyond simple keyword matching for monitoring (e.g. advanced regular expressions, partial document matching, Bayesian analysis and machine learning)

Content discovery makes use of crawlers to find sensitive content in an organization’s network of machines. Each crawler is composed of a connector, browser, filtering module and reader. A connector is a data-source specific module that helps in connecting, browsing and reading from a data source. So, there are connectors for various types of data sources like CIFS, NFS, HTTP, FTP, Exchange, Notes, databases and so on. The browser module lists what all data is accessible within a data source. This listing is then filtered depending on the requirements of discovery. For example, if the requirement is to discover and analyze only source code files, then all other types of files will be filtered out of the listing. There are many dimensions (depending on meta-data specific to a piece of data) on which filtering can be done: name, size, content type, folder, sender, subject, author, dates etc. Once the filtered list is ready, the reader module does the job of actually downloading the data and any related meta-data.

The monitoring component is typically composed of following modules: data tap, reassembly, protocol analysis, content analysis, indexing engine, rule engine and incident management. The data tap captures data from the wire for further analysis (e.g. WireShark aka Ethereal). As mentioned earlier, this capture can happen at any protocol level – this differs from vendor to vendor (depending on design philosophy). After data is captured from the wire, it is beaten into a form that is suitable for further analysis. For example, captured TCP packets could be reassembled into a higher level protocol like HTTP and further into application level data like Yahoo! Mail. After data is into a analyzable form, first level of policy/rule evaluation is done using protocol analysis. Here, the data is parsed for protocol specific fields like IP addresses, ports, possible geographic locations of IPs, To, From, Cc, FTP commands, Yahoo! Mail XML tags, GTalk commands and so on. Policy rules that depend on any such protocol-level information are evaluated at this stage. An example is – outbound FTP to any IP address in Russia. If a match occurs, it is recorded with all relevant information into a database. The next step, content analysis, is more involved: first, actual data and meta-data is extracted out of assembled packet, and then content type of the data (e.g. PPT, PDF, ZIP, C source, Python source) is determined using signatures and rule-base classification techniques (a similar but less powerful thing is “file” command in Unix). Depending on the content type of data, text is extracted along with as much meta-data as possible. Now, content based rules are applied – for example, disallow all Java source code. Again, matches are stored. Depending on the rules, more involved analysis like classification (e.g. Bayesian), entity recognition, tagging and clustering can also be done. The extracted text and meta-data is passed onto the indexing engine where it is indexed and made searchable. Another set of rules, which depend on contents of data, are evaluated at this point; an example: stop all MS Office or PDF files containing the words “proprietary and confidential” with a frequency of at least once per page. The indexing engine typically makes use of an inverted index, but there are other ways also. This index can also be used later to do ad-hoc searches (e.g. for deeper analysis of a policy match). All along this whole process, the rule engine keeps evaluating many rules against many pieces of data and keeping a track of all the matches. The matches are collated into what are called incidents (i.e. actionable events – from an organization perspective) with as much detail as possible. These incidents are then notified or shown to the user and/or also sent to the prevention module for further action.

The prevention module contains a rule engine, an action module and (possibly) connectors. The rule engine evaluates incoming incidents to determine action(s) that needs to be taken. Then the action module kicks in and does the appropriate thing, like blocking the data, encrypting it and sending it on, quarantining it and so on. In some scenarios, the action module may require help from connectors for taking the action. For example, for quarantining, a NAS connector may be used or for putting legal hold, a CAS system like Centera may be deployed. Prevention during content discovery also needs connectors to take actions on data sources like Exchange, databases and file systems.

[edit] Going Further

There are many “value-added” things that are done on top of the functionality described above. These are sometimes sold as separate features or products altogether.

  • Reporting and OLAP. Information from matches and incidents is fed into cubes and data warehouses so that OLAP and advanced reporting can be done with it.
  • Data mining. Incident/match information or even stored captured data is mined to discover patterns and trends, plot graphs and generate fancier reports. The possibilities here are endless and this seems to be the hottest field of research in DLP right now.
  • E-discovery. Here, factors important from an e-discovery perspective are extracted from the incident database or captured data and then pushed into e-discovery products or services for processing, review or production purposes. This process may also involve some data mining.
  • Learning. Incidents and mined information is used to provide a feedback into the DLP setup. Eventually, this can improve existing policies and even provide new policy options.
  • Integration with third-parties. For example, integration with BlueCoat provides setups that can capture and analyze HTTPS/SSL traffic.

DLP in Reconnex

Reconnex is a leader in the DLP technology and market. Its products and solutions deliver accurate protection against known data loss and provide the only solution in the market that automatically learns what your sensitive data is, as it evolves in your organization. As of today, Reconnex protects information for more than one million users. Reconnex starts with the protection of obvious sensitive information like credit card numbers, social security numbers and known sensitive files but goes further by storing and indexing upto all communications and upto all content. It is the only company in this field to do so. Capturing all content and indexing it enables organizations to learn what information is sensitive and who is allowed to see it, or conversely who should not see it. Reconnex is also well-known for its unique case management capabilities, where incidents and their disposition can be grouped, tracked and managed as cases.

Reconnex is also the only solution in the market that is protocol-agnostic. It captures data at the network level and reconstructs it to higher levels – from TCP/IP to HTTP, SMTP and FTP to GMail, Yahoo! Chat and Live.

Reconnex offers all three flavors of DLP through its three flagship products: iGuard (data-in-motion), iDiscover (data-at-rest) and Data-in-Use. All its products have consistently been rated high in almost surveys and opinion polls. Industry analysts, Forrester and Gartner, also consider Reconnex a leader in their domain.

About the author: Ankur Panchbudhe is a principal software engineer in Reconnex, Pune. He has more than 6 years of R&D experience in domains of data security, archiving, content management, data mining and storage software. He has 3 patents granted and more than 25 pending in fields of electronic discovery, data mining, archiving, email systems, content management, compliance, data protection/security, replication and storage. You can find Ankur on Twitter.

Related articles:

Zemanta Pixie

Reconnex named a leader in DLP by Forrester

(Newsitem forwarded to punetech by Anand Kekre of Reconnex)

Reconnex, has been named a “top leader” in the data leak prevention space by Forrester in its DLP Q2 2008 report.

DLP software allows a company to monitor all data movements in the company and ensure that “sensitive” data (i.e. intellectual property, financial information, etc.) does not go out of the company. Reconnex and Websense have been named as the two top leaders in this space by Forrester.

Forrester employed approximately 74 criteria in the categories of current offering, strategy, and market presence to evaluate participating vendors on a scale from 0 (weak) to 5 (strong).

  • Reconnex received a perfect score of 5.0 in the sub-categories of data-in-motion (i.e., the network piece of DLP), unified management, and administration
  • Reconnex tied for the top scores in the sub-categories of data-at-rest (i.e., discovery) and forensics

“Reconnex offers best-in-class product functionality through its automated classification and analysis engine, which allows customers to sift through the actual data that the engine monitors to learn what is important to protect,” according to the Forrester Wave: Data Leak Prevention, Q2 2008 Report. “This solution stands out because it is the only one that automatically discovers and classifies sensitive data without prior knowledge of what needs to be protected.”

For more information about this award, see Reconnex’ press release.

For more information about Reconnex technology, see the punetech wiki profile of Reconnex.

Related articles:

Zemanta Pixie

User groups in Pune & meetings this weekend: Java, Linux, Flex, Ruby

I just added information about four user groups in Pune to the Groups and Organizations page in the PuneTech wiki: PuneRuby, PuneJava, Pune Linux Users Group (PLUG), and Pune Flex Users Group (PuneFUG). Please take a look at their pages to get an idea of their activities. PLUG and PuneJava have regular meetings. PuneRuby and PuneJava have very active mailing lists. PuneFUG is relatively new, but it looks like they will have regular meetings.

PuneJava has a talk on agile development this Saturday at 6pm. (You should already know this; otherwise subscribe to PuneTech updates!) Just before that, at the same location, PLUG is holding its monthly meeting (see their website for details). Pune Flex Users Group is holding a  meeting on Sunday at 5pm (see their wiki for details).

Also, the Pune OpenCoffee Club (for entrepreneurs, and others interested in the startup ecosystem in Pune) is planning on a meeting next weekend. Chip in on that discussion if you want to influence the time/location or agenda.

Update: Rohit points out in the comments that there’s a Pune bloggers lunch on Saturday at 12:30pm.

Agile Development for Java Enterprise Applications – 7 June

What: Talk on “Agile Development for Java Enterprise Applications” by Prerna Patil from Oracle Financial Services (formerly i-Flex solutions)

When: Saturday, 7th June 2008, 6.00 pm – 7.30 pm
Where: Symbiosis Institute of Computer Studies and Research (SICSR), 7th floor, Atur Center, Model Colony, Pune
* Entry is free of cost. No registration required. Entry on first come first served basis.
This event is an activity of the PuneJava group.

Session Overview : Agile is a way to quickly develop working applications by focusing on progressive requirement rather than processes. Agile development is done in iterative manner with short requirements, quick builds and frequent releases. Agile methodology when compared to traditional practices like waterfall model, makes development easier, faster and adaptive.

The session would provide a roadmap for building enterprise-class Java applications using agile methods. It would include introduction to agile methodology and when and why it should be used. Various practices used for agile development (Agile Modeling, Agile Draw, Agile estimation) would be discussed. Agile development based Case Study would be drawn using: Light weight technologies like Spring; ORM for database handling; Test Driven Development approach; Build management & Configuration control in concurrent development environment. Session would also include coding practices to make code adaptable to new requirements and tips for using IDE (Eclipse & Netbeans) for agile development.

Speaker Bio : Prerana Patil has over 5 years of experience of working with Java and Java Enterprise Applications. She is currently working in Technology Practice group of Oracle Financial Services (formerly i-flex solutions limited). She is a Masters in Computer Science from UOP and loves exploring the new things in software world. She has been involved in various trainings on Java, Java EE.
* For those not in Pune or unable to attend, add your queries in the comments section at http://www.indicthreads.com/news/1228/agile_development_enterprise_java_meet.html and the organizers promise to get them answered at the meet.

Linux Kernel Internals – 7 June

Posted on behalf of Anurag Agarwal of KQ Infotech. This is a paid training program. See the end of this article for fees and other logistics. Disclaimer: PuneTech does not accept any remuneration, monetary or otherwise, for publishing content. Postings of a commercial nature (e.g. paid training program) are posted solely on the basis of whether or not they fit in with the charter of PuneTech and whether the readers would find those interesting. Please let me know your views on this issue. I’m posting this one because it involves “deep” technology, I think it would be of interest to a number of PuneTech readers, and because I can recommend the program based on the reputation of the trainers.

Are you planning to make a career in development in Linux kernel but don’t have right skills?
Have you burned your fingers writing your own code in the Linux kernel?

KQ Infotech is launching a unique training and mentoring program for you. There are two parts to this program. First one is the forty hours intensive training and other one a long term mentoring.

Training will provide you a good understanding of all the sub systems of Linux. It will enable you to debug, modify Linux kernel and write various device drivers. There will be a number of theory sessions, practical sessions, peer code reviews and code walk through. All the assignments will be targeted towards specific Linux kernel functionalities.

But it is very unlikely that a forty hours training would be able to make you a good Linux kernel programmer. In the absence of a real Linux kernel project, it is very unlikely that you will become a good Linux kernel programmer. After forty hours training a special mentoring program will be launched for this purpose.

All the students will be provided with a medium term Linux kernel project. KQ Infotech will facilitate a two hour weekly meeting to discuss and solve this project related problems. This program will enable one to become a good Linux kernel programmer in three to six months time.

About the Trainers
We have 7 to 11 years experience in developing systems storage software and virtualization solutions at Symantec (formerly Veritas) and are well versed with the challenges and techniques for engineering in the kernel space. We believe that our strong development background on heterogeneous platforms (Linux, solaris and AIX)  will enable us to give our audience a better perspective on developing in the Linux kernel.

Course Contents
This course assumes good knowledge of C, familiarity with editors (vi/emacs) and basic concepts of user land and kernel space.

This course is designed for the people having their day job. This course will be delivered on Saturday and Sunday, for five hours each day. There will be assignments, code review, code walk through.

A brief overview of the course contents is below

Day 1: Introduction to Linux

  • Linux Kernel Features

  • Code layout

  • Build Linux Kernel

  • Introduction to the Kernel Module API

  • Legal Issues with Linux Kernel

  • Introduction to System Calls

Day 2: Process management and scheduling

  • Introduction to Process Management

  • Linux Kernel Order one scheduler

  • Fork system call

  • Signals and signal handlers

  • debugging techniques with Linux Kernel

  • Proc file system

Day 3: Synchronization

  • Important Data structures in the kernel

  • Synchronization primitives in the linux kernel

Day 4: Interrupts

  • Introduction Interrupts and ISRs

  • Introduction to bottom halves, tasklet

Day 5: File System

  • Introduction to VFS

  • ext2/ext3 filesystem

Day 6: Device Drivers

  • Introduction to character device drivers

  • Introduction to block device drivers

Day 7 and 8: Memory Management

  • Introduction to Linux Memory Management

  • Allocator, allocation schemes

  • Process view of memory, kernel view of memory

  • Drill down on x86 specific: L1, L2, TLB
  • Paging and swapping

Logistics:
First batch for this course will start from 7th June.
This course will be delivered at Pune IT park.
There will be five hours session on both Saturday and Sunday.

Fee:
There is one combined fee for training and mentoring program. It is 25,000/-. There is special discount of 20% for first batch. Fee for first batch will be 20,000/-.

Contact us:
Anurag Agarwal: talk2anurag@gmail.com or 9881254401
Anand Mitra: anand.mitra@gmail.com or 9881296791