Data quality has been playing a vital role across organizations and is a fundamental aspect to any of the AI or analytics initiatives out there.
One has to understand that data quality requirements in AI projects differ from the conventional ones. However, it does not mean that both are totally different. We can safely say that data quality checks in general are usually a subset of the quality check for AI as the latter has some additional requirements for data to be fit for usage.
Before AI became the norm, data quality was widely mentioned in reports delivered to internal or external clients. The focus was on checking some of the most common ‘visible’ traits like data completeness, consistency etc. However, with advancement in tools to analyze data and the emergence of sophisticated techniques like AI, the scope of the focus has increased to look into the ‘invisible’ traits like data drift, outlier detection etc.
Data Quality in general
In general, the data is of a higher quality when it meets the requirements of its intended use to customers, decision-makers, sub-applications, processes and especially to build any AI models. A good analogy is the quality of a product produced by a manufacturer. A quality product is not the business outcome, but something which drives customer satisfaction and contributes to the value and life cycle of the product itself. Similarly, data quality is an important attribute that can drive data value and therefore, it impacts the business outcome, such as compliance, customer satisfaction, or decision-making accuracy.
5 key parameters used to measure data quality in general:
Accuracy: Data described needs to be accurate.
Eligibility: The data must meet the requirements for the intended use.
Completeness: Data should not contain missing amounts or missed data records.
Time: Data must be up to date.
Consistency: Data should be in the expected format and may have different references for the same results.
Data Quality for AI
Apart from the above 5 parameters applicable to Data Quality for AI, it has the additional 3 parameters as well –
Business/Domain context-based:
• The column values should be unique/ no duplicate records (Product ID/Transaction ID)
• There are some instances where the business expects to have one to one mapping between two columns (Product ID vs XYZ)
• Business expects a few numerical columns to be always positive (Sales)
• Also, they expect a few columns to have a specific length (Card number) and many more
Statistics based:
• Missing values
• Outliers
• Duplicate records and many more
Logical based:
• Values in column X cannot be equal to or greater than the values in column Y (Clicks and Impressions)
Approach to have best data for AI:
Most of the time AI professionals focus on creating state-of-the-art models and usually ignore the importance of data. Andrew Ng’s data-centric approach to improve model performance throws light on how the focus on better data development creates a slightly better impact on model performance than hyper-parameterization. He has emphasized that more focus should be on data to improve model performance and accuracy than creating complex models. Data quality is not something that can be fundamentally improved by finding problems and fixing them. Instead, the whole organization should start by producing good quality data right from the beginning.
Incorrect data entry for AI produces wrong predictions. It also has a huge impact on the productivity of data scientists, as 80% of their time is spent cleaning and preparing the right data set. Organizations should prioritize data management and data quality implementation to minimize this impact. Data quality tools help automatically build data quality rules and fix bad data, which is the best and cheapest way to improve your business data quality.
The importance and why continuous data monitoring is required for AI?
Data Quality checks are not a one-time activity, it has to be monitored frequently. Especially when the AI models go into production. Most of the time we perform data quality checks at the initial stage. Also, we remediate them accordingly and build the model. But that is not sufficient. When the new data comes into the production for retraining/predictions there might be other discrepancies that we have never seen while building the AI. Hence frequent monitoring is required.
Many tools will help to automate important data quality rules and help emphasize data integrity. Governance tools for continuous data monitoring tend to significantly improved data accuracy from 𝟔𝟎% 𝐭𝐨 𝟗𝟎% and help improve the AI model performance.
Famous Quotes on Data:
- Data drives speed to value.
- 77% believe data quality hurts their competitive advantage.
- 7% of Analytics models fail to make it to production and amplify business problems.
- Data is Food for AI.