Data quality — What is it and how good does my data have to be?

Inaccurate data entails problems and costs. We explain what data quality is and how good your data should be.

Münster, Münster, MÜNSTER or MÜNSTER, 0000-0000-00 as a customer contact number, 99/99/99 as the date of purchase... the examples of incorrect data are long and the problems and costs of poor data quality are real: from not reaching a customer to the wrong contact in a newsletter to incorrect invoicing, to name just a few examples. Decisions that are made on the basis of bad data cannot be good. A survey by Experian Marketing Services According to this, 73% of German companies believe that inaccurate data is preventing them from providing an outstanding customer experience. Good data quality is therefore decisive for the day-to-day actions of a company and, above all, a key success factor for data science projects. But what does data quality actually mean, how good does the data have to be for a data science project and how can you check the quality of your data? We will address these questions in this article.

WHAT IS DATA QUALITY AND WHY IS DATA QUALITY SO IMPORTANT?

Definition: Data quality describes how well the data sets are suitable for intended applications. In this context, we therefore also speak of “fitness for use,” meaning the suitability of the data. The quality of data is therefore very context-dependent. This is because while data quality may be sufficient for a specific use case, it may still be insufficient for another.

And why is it so important? In a data science project, everything is based on the resource data. In the project, data from a wide variety of sources is brought together and then analyzed. Your data therefore serves as input for any analysis model. True to the saying “garbage in, garbage out,” even a sophisticated algorithm is useless if the quality of the data is poor. Even though a data science project can fail for many reasons, project success often depends on the quality of the available data.

More about data science projects ➞

Investments in measures that ensure the quality of the data are therefore decisive for project success, but also more than worthwhile beyond that. This is because poor data quality can result in significant costs for a company.

POOR DATA QUALITY COSTS

  • 50% of IT budgets are spent on reprocessing data (Zoominfo).
  • Once a data series has been recorded, it costs 1 dollar to verify it, 10 dollars to clean it up and 100 dollars if it remains incorrect (Zoominfo).

In principle, however, poor data quality has far more far-reaching consequences than financial losses. They range from effects on employee confidence in decisions and customer satisfaction to productivity losses (e.g. due to additional time required to prepare data) and compliance problems.

WHAT ARE THE SOURCES OF POOR DATA QUALITY?

The sources of poor data quality can be very diverse, as the chart below shows. However, the data entry process usually comes first, whether from employees or customers.

HOW CAN YOU MEASURE DATA QUALITY?

In practice, there are a variety of criteria that can be used to assess the quality of data. The most common evaluation criteria include the following:

  • Correctness
    Is the data factually consistent with reality?
  • consistency
    Does the data from different systems match each other?
  • comprehensiveness
    Does the data set contain all necessary attributes and values?
  • uniformity
    Is the data available in the appropriate and the same format?
  • Freedom from Redundancy
    Are there no duplicates within the data sets?
  • precision
    Is the data available with sufficient accuracy?
  • topicality
    Does the data reflect the current state of affairs?
  • intelligibility
    Is every data set clearly interpretable?
  • Reliability
    Is the origin of the data comprehensible?
  • relevancy
    Does the data meet the respective information requirements?
  • availability
    Is the data accessible to authorized users?

The criteria of correctness, completeness, uniformity, accuracy and freedom from redundancy generally relate to the content and structure of the data and cover a wide range of sources of error that are most commonly associated with poor data quality. These usually include data entry errors, such as types, duplicate data entries, but also missing or incorrect data values.

Using examples, the following chart provides an overview of the errors behind the individual criteria as well as possible causes and countermeasures.

WHAT IS GOOD ENOUGH DATA QUALITY?

Of course, the more complete, consistent, and error-free your data, the better. However, it is almost impossible to ensure that all data meets the above criteria 100%. In fact, your data doesn't have to be perfect at all, but must meet the requirements of the people or the purpose for which the data is to be used.

How good does the quality of the data have to be for a data science project? Unfortunately, there is no universal answer to this question. As is often the case, there are also some aspects that affect the required data quality. This includes, among other things, the purpose for which the data is to be used, the application and the desired modelling method. Data quality also depends on the type of errors they have and to what extent they occur as part of data preparation during a Data science project Have it corrected.

Which errors in data quality can be corrected?

  • Errors that can be corrected with relatively little effort include duplicate data entries.
  • Errors that can be corrected with increased effort include mixing or deviating formats.
  • Errors that cannot be corrected, however, include invalid data, missing entries, or errors caused by swapping input fields.

Data quality problems can therefore be resolved to varying degrees afterwards. In order to be able to successfully process the data, it is necessary for data scientists and specialist departments to work together so that it is clear which data is correct and which needs to be corrected. A so-called data dictionary can help to ensure that everyone can understand what is in the data.

So even though some mistakes can be fixed, the better approach is always not to let things get that far in the first place. Our following checklist is intended to help you subject your data to an initial quality check.

CONCLUSION

Data is now considered the fourth production factor in addition to land, capital and labor. Data should therefore be regarded as a critical resource that should be managed accordingly if you are not already doing it. To ensure high data quality, a comprehensive data quality management system is required. Because data quality is by no means a pure IT, but a management task. The issue of data quality is a small but important part of an overall data strategy. This requires various measures, which include both initial, one-off measures and activities to be carried out continuously.

In a nutshell, we would therefore like to provide you with the following best practice measures:

  • Make the quality of your data a priority.
  • Automate the recording of your data
  • Maintain Your Master and Metadata
  • Prevent errors and don't just deal with them.

Because problems in data quality not only have an impact on the success of a data science project, but also have far-reaching consequences for the company as a whole. The good news for your data science project, however, is that you don't need the perfect data set. And, some mistakes, though far from all (!) , can be fixed by data scientists as part of data preparation.