Talk Dirty (Data) to Us
Predictive models generated based on dirty data produce inaccurate models. And wrong models lead to wrong courses of action. Thus “Dirty Data” causes millions in lost opportunities. So, unless dirty data is “cleaned”, it leads to waste.
Dirty data are defined as those that are inaccurate, incomplete, inconsistent, or contains erroneous information. Manual input of data, poor data validation and system limitations are major causes of Dirty Data.
Here are some examples of dirty data:
Incorrect data – it can occur when the field value does not comply within the valid range of values. For instance, the month field should have a value that ranges from 1 to 12, or an individual’s age should not exceed 120. A wrong decimal point drove an Indian man to take his life.
Duplicate data – it may occur due to repeated submission, improper data joining or user error. This type of data leads to inconsistencies where the same values are stored in different places which leads to redundancy. Duplicate data skews some machine learning algorithms.
Inaccurate data – it is possible that a data value can be technically correct but not accurate within the business context. It is best to examine the data value against other files or fields to verify its accuracy. An example would be errors in customer’s addresses. The address on file might be an actual address but it is tagged to the wrong owner. For numeric fields, outlier detection takes out anomalies that otherwise produce a wrong regression analysis.
Business rule violations – data values should follow business rules that are specific to the specific industry or business context of a company. For instance, a liquid product should have a unit of measure in “liters” or “milliliters”.
Incomplete data – these are data with missing values. You can use several methods to address these. Either delete the record or use an input average value for those missing data. In addition, dedicated data teams can be tasked to research and fill out those missing values.
As Dataversity points out, accurate data can save lives and increase revenues:
…the Poneman Institute revealed that 86% of all healthcare practitioners know of an error caused by incorrect patient data. Patient misidentification is also responsible for 35% of denied insurance claims, costing hospitals up to $1.2 million annually.
Admittedly, identifying and the cleaning of dirty data is time consuming and labor intensive. But have you considered getting outside help? We have expert data cleansing teams available for one off or long-term assignments. Our teams can speed up your ETL (Extract, Transform and Loading) tasks. This speeds up your machine learning. Create your competitive advantage from having predictive models based on cleansed data faster!