Data should be used only for purposes it is relevant and it should be accurate, complete and kept up to date. This is pretty hard to do, and data collectors should provide the ability for users to update their data. Data is considered high quality if it is fit for its intended functions, in operations, decision making and planning. Despite a basic understanding of data quality, many people don’t quite grasp what is meant by ‘Quality’.
One of the biggest myths about data quality is that it must be completely error free. With the massive data collectors collecting data, getting zero errors is next to impossible. Instead, the data only needs to conform to the standards that have been set for it.
Everything involved in data collection, such as making it fit for the company needs, opens it up to potential errors. Having data that is 100% complete and 100% accurate is not only expensive but also time consuming. With so much data coming in, decisions must be made quickly, and this is why data quality is a delicate balancing act. Nevertheless, data profiling is the way out of this mess.
It involves looking at all the information in your database to determine if it is accurate and/or complete, and what to do with the entries that are not complete. With data profiling, you are determining how accurate the data is. For example, If you have launched 1/7/15, does the system record 1915 or 2015? You may even uncover duplicates and other issues from the collected information. Profiling the data in this way gives us a staring point to make sure that the information we are using is of the best possible quality.
Now that you know the starting point, how do you ensure that information is complete and accurate? What do you do when you find errors or issues? Typically, you can do any of these four things:
- Accept the Error – If it falls within an acceptable standard, you can decide to accept it and move on to the next entry.
- Reject the Error – Sometimes, particularly with data imports, the information is so severely damaged or incorrect that it would be better to simply delete the entry altogether than try to correct it.
- Correct the Error – Misspellings of customer names are a common error that can easily be corrected. If there are variations on a name, you can set one as the “Master” and keep the data consolidated and correct across all the databases.
- Create a Default Value – If you don’t know the value, it can be better to have something there (unknown or n/a) than nothing at all.
There is no specific approach that fits all when it comes to maintaining accuracy and completeness of every type of data for every business. With Big data’s appetite for information growing more and more every day, its becoming more important than ever to maintain data quality issues head- on. Even though its tiresome, its worth maintaining data hygiene letting computers do what they do best /Crunch numbers.