I am liveblogging CSI Pune‘s lecture on Data Management and Data Quality in Business Intelligence, by Ashwin Deokar of SAS R&D Pune.
Huge amounts of data being generated these days. Different technologies (from databases, to RFID tags and GPS units), different platforms (PCs, servers, cellphones), different vendors. And all this data is often duplicated and inconsistent. All of this data needs to be collected in one place, and cleaned up?
Why? Three reasons:
- Competitive business environment: With better, and more granular data, you can increase your profits, and reduce costs. For example, Walmart forcing RFID tags on all items that are supplied to them by suppliers – and tracking their locations for very accurate and up-to-date inventory control
- Regulatory and Compliance requirements: e.g. US government has seriously strict data gathering and storage requirements for hospitals (HIPAA). If you can’t generate this data, you go to jail. That certainly reduces your ability to increase profits.
- Adherence to Industry standards: If you can’t produce and consume data in the format that everybody else understands, you can’t play with the big boys
The key areas of study in this area are:
- Data governance: Policies that govern the use of data in an organization. Done usually from the point of view of increasing data security (prevent hackers from getting in, prevent data from leaking out inadvertently), ensuring compliance with regulations, and optimal use of data for organizational growth.
- Data architecture and design: Overall architecture – data storage, ETL process design, BI architecture, etc.
- Database management: Since there are huge quantities of data, making a mistake here will pretty much doom the whole project to failure through overload. Which database? Optimizing the performance. Backup, recovery, integrity management, etc.
- Data security: Who should have access? Which data needs to be kept private?
- Data quality: Lots of work needed to ensure that there is a single version of the truth in the data warehouse. Especially difficult for non-transactional data (i.e. data that is not there in a database). e.g. Ashwin Deokar is the same as A.P. Deokar. Need fancy software that will do these transformations on the data.
- Data Warehousing and Business Intelligence: What this component does is covered in a previous PuneTech article.
Data Quality. Why this is an important problem:
- 96000 IRS tax refund cheques did not get delivered because of incorrect addresses.
- An acquiring company, which acquired another company mainly for the customer base found that the acquisition was vastly overvalued – because the got 50% fewer customers than expected. Due to duplicates in the database.
- A cable company lost $500,000 because a mislabeled shipment resulted in a cable being laid at a wrong location.
- A man defrauded a CD company by taking their “introductory” offer (of free CDs) over 1600 times, by registering that many different accounts with different address. Since he did not really have that many different addresses, he managed to fool their computers by making slightly different address using minor changes like extra punctuation marks, fictitious apartment numbers, slightly different spellings, etc. Total damage: $250,000.
There is a process, combination of automated algorithms, and human assistance to help with improving data quality. And it is not just about duplicate data, or incorrect data. You also need to worry about missing data. And fetching it from the appropriate “other” sources.
What do you do?
- Clean up your data by standardizing it using rules – have canonical spellings for names, addresses, e etc.
- Use fancy algorithms to detect duplicates which are obvious by just looking at the strings. For example, “IBM” and “International Business Machines” do not look similar. But if they have the same address, same number of employees, etc., then you can say they are the same. (And you can have thresholds that adjust the sensitivity of this matching process.)
- Use external data to clean up the data for de-duplication. For example, US Postal service publishes a CD of every valid address in the US. Businesses buy that CD and use that to convert all their address data to this standard format. That will result in major de-duplication.
SAS provides tools for all the steps in this process. And since it has all the pieces, it has the advantage of ensuring that there is a single meta-data repository for all the steps in this process – which is a huge advantage. SAS has the best ETL tools. It also exists in analytics, and BI. It has OLAP capabilities, but it really excels in business intelligence applications.
SAS R&D Pune has engineers working on various core products that are used in this area – meta-data, ETL, BI components. It also has a consulting group that helps companies deploy SAS products and use them – and that ends up working on all the parts of the data management / data quality process.