What is data quality and why do companies need to pay close attention to it?

Inaccurate data entails problems and costs. We explain what data quality is and how good your data should be.

Inhaltsverzeichnis

Inaccurate data is expensive. And also problematic. We explain what data quality is and how good your data must be.

Who doesn't know it?

Name: Am Mustergraben 8

Telephone number: 000-0000-000

Purchase date: 32/32/32

...

The list of incorrect data is long and the problems and costs of poor data quality are an everyday reality in the German corporate landscape: From not reaching customers to the wrong address in a newsletter to incorrect invoicing — just to name a few examples. True to the “garbage in, garbage out” principle, bad data leads to bad decisions. Experian Marketing Services found out the following: 73% of German companies think that inaccurate data prevents them from offering an outstanding customer experience.

Good data quality is therefore crucial for a company's day-to-day actions and, above all, a significant success factor — not just for data science projects. But what does data quality actually mean, what are data quality criteria, how good does data have to be for a data science project and how can companies measure data quality? We will answer these questions in this article.

What is data quality and why is data quality so important?

Data quality definition: In general, data quality (Eng. Data Quality) on How accurate, complete, consistent, and up-to-date data is. High data quality means that the data is free from errors, inconsistencies, and outdated information. Low data quality leads to erroneous insights and poor decisions based on inaccurate or incomplete data.

However, data quality also describes How well data sets are suitable for intended applications. In this context, we therefore also speak of the”Fitness for Use“— i.e. the suitability of the data. The quality of data is therefore very context-dependent. Because while data quality may be sufficient for a specific use case, it could still be insufficient for another case.

Why is it so important? Data quality is the basis for companies to have trustworthy data for analyses, business processes and decision-making. In a data science project, everything is based on the resource data. In the project, data from various sources is brought together and then analyzed. This data therefore serves as input for any analysis model. A sophisticated algorithm is therefore useless if the quality of the data is poor. Even though a data science project can fail for many reasons, the success of the project depends primarily on the quality of the available data. Please also read our article”How to master data science projects”.

You are well advised to invest in measures that ensure the high quality of your data. Decisive for project success on the one hand, but also beyond that. This is because poor data quality can result in significant (follow-up) costs for a company. Let's have a look at that.

Poor data quality costs several times

Poor data quality has a name: Dirty Data. This data is characterized by low scores when it comes to consistency, completeness, accuracy and timeliness. Here are a few facts about the monetary impact of these circumstances:

  • Gartner appreciates the average loss of sales from companies due to incorrect data on up to 13 million US dollars (Gartner's Data Quality Market Study). In other words, the costs of poor data quality amount to 15% to 25% of your turnover (published Study in the MIT Sloan Management Review).
  • 50% of IT budget are issued for reprocessing data (Zoominfo).
  • As soon as a data series has been recorded, it costs 1 dollar This to verifying, 10 dollars This to purge and 100 dollarsWhen this remains faulty (Figure 1).

(Figure 1: The 1-10-100 Rule of Bad Prospect Data, ebq)

In addition, poor data quality has far more far-reaching consequences than financial losses. These include effects on employee confidence in decisions, customer satisfaction and brand image, lost productivity (e.g. due to additional time required to prepare data), compliance problems, negative consequences for sales and marketing teams, slowed sales cycles, and more. Today, every business runs on data — build a secure foundation. And that means a clean data quality management.

(Figure 2: Root cause analysis of data errors, Researchgate)

What are the sources of poor data quality?

The sources of poor data quality can be very diverse, as shown in Figure 2. However, manual human data entry usually comes first (see Figure 3).

1. Manual data entry: When people manually enter data, it's easy to make careless errors, typos, or inconsistencies. Even small deviations can accumulate in the data sets and lead to massive quality problems. This is the main reason for low data quality.

2. Outdated data: If records aren't regularly updated and cleaned, they lose accuracy and relevance over time. Outdated data leads to distorted analyses and incorrect business decisions.

3. Data silos: Isolated data sets in different systems make data consistency and integration extremely difficult. Redundant, contradictory data is the result.

4. Lack of data management: Without clear processes, metrics, roles, and responsibilities for data quality management, this critical area remains uncoordinated and neglected. As part of the GDPR, public bodies or authorities must always appoint a data protection officer (DPO). However, naming a DPO can also be useful for non-committed companies in order to ensure appropriate data quality.

5. Complex data sources: The more heterogeneous data sources of different structures and origins have to be integrated, the more complex data cleansing and harmonization becomes.

6. System failure: Even small errors or bugs in databases, interfaces, or ETL processes can lead to significant data errors on a large scale.

7. Lack of workforce qualification: Without training and raising employee awareness of data quality, this issue remains underrated and susceptible to human error. Data must become part of their corporate culture.

Overall, organizations must strategically address data quality. That means continuous monitoring, automated rules, and an enlightened data culture.

(Figure 3: Human error types, Researchgate)

How is data quality measured?

In practice, there are a variety of criteria that can be used to assess the quality of data. The most common evaluation criteria include the following:

  • Correctness: Is the data factually consistent with reality?
  • Consistency: Does the data from different systems match each other?
  • Vollständigkeit: Does the data set contain all necessary attributes and values?
  • uniformity: Is the data available in the appropriate and the same format?
  • Freedom from redundancy: Are there no duplicates within the data sets?
  • Genauigkeit: Are the exact numbers of the data available?
  • topicality: Does the data reflect the current state of affairs?
  • intelligibility: Is every data set clearly interpretable?
  • reliability: Is the origin of the data comprehensible?
  • relevancy: Does the data meet the respective information requirements?
  • availability: Is the data accessible to authorized users?

The criteria of correctness, completeness, uniformity, accuracy and freedom from redundancy generally relate to the content and structure of the data. They cover a wide range of sources of error that are most commonly associated with poor data quality. These usually include data entry errors, such as typos or duplicate data entries, but also missing or incorrect data values.

What is good enough data quality?

Of course, the more complete, consistent, and error-free your data, the better. However, it is almost impossible to ensure that all data always meets 100% of the above criteria. In fact, your data doesn't have to be perfect — instead, it must primarily meet the requirements or purpose for which the data is to be used.

How good does the quality of the data have to be for a data science project? Unfortunately, there is no universal answer to this question. Depending on the application, there are always individual aspects that affect the required data quality. As already stated, this includes the purpose for which the data is to be used. This therefore means the application and the desired modelling method. In principle, the following guidelines should be observed:

1. Purpose-oriented: Data quality must meet specific business requirements and uses. Lower quality may be sufficient for simple operational processes than for critical analyses.

2. Risk assessment: The higher the potential risks and costs of incorrect data, the higher the quality standards must be.

3. Stakeholder acceptance: Data quality should meet the expectations and minimum requirements of all stakeholders and data users.

4. Balance: A balance must be found between acceptable quality levels and reasonable costs for data preparation and data cleansing.

5. Continuous improvement: Quality requirements should be regularly reassessed and gradually increased when appropriate.

Ultimately, there is no blanket “sufficient” data quality. This must be defined individually, taking into account usage scenarios, costs, risks and the development of the company. A holistic data quality management is the key.

Which errors in data quality can be corrected?

There are different types of data quality errors that need to be handled differently depending on their severity and nature. You guessed it: The more difficult the treatment, the more expensive the correction.

  • Errors that can be corrected with relatively little effort, such as spelling mistakes or duplicate data entries.
  • Errors that can be corrected with increased effort, such as mixing or deviating from formats.
  • Errors that cannot be corrected, such as invalid data, missing or outdated inputs.

In order to be able to successfully process the data, it is necessary for data scientists and specialist departments to work together so that it is clear which data is correct and which needs to be corrected. A so-called data dictionary can help to ensure that everyone can understand what is in the data. A Data Dictionary is an important tool for monitoring and improving data quality. It is a collection of metadata that contains information about the structure, content, and use of data.

So even though some mistakes can be fixed, the better approach is always not to let things get that far in the first place. Our following best practice checklist will help you to subject your data to an initial quality check.

Data quality management best practice checklist

Here is a quick checklist for your data quality management.

Data Strategy and Data Governance

  • Define clear data quality goals and metrics
  • Appoint data controllers and create policies
  • Establish privacy and security policies

Data collection and data integration

  • Validate data as it is entered using verification rules
  • Eliminate data silos and integrate data sources
  • Make use of Master Data Management (MDM) for consistent master data

Data cleansing and data correction

  • Implement deduplication and data reconciliation rules
  • Address erroneous, inconsistent, and outdated data
  • Enrich data with external sources

Continuous monitoring

  • Continuously monitor data quality
  • Conduct regular data audits
  • Create data quality reports for management (& stakeholders)

People and processes

  • Train your employees in data quality management
  • Implement error message and resolution processes
  • Automate data quality routines where possible

Tooling and technology

  • Use dedicated data quality tools
  • Integrate data quality rules with existing systems
  • Use data governance and metadata features

improvement cycle

  • Analyze and prioritize data quality issues
  • Continuously optimize processes and rules
  • Aim for data quality as a core competency

Conclusion

Data is now considered the fourth production factor in addition to land, capital and labor. For this reason, data is therefore to be regarded as a critical resource that must be managed accordingly. In order to ensure high data quality, a comprehensive data quality management system is mandatory. Data quality is a management task and is by no means unique to the IT sector. The topic of data quality is the basis of the entire data strategy. This requires various measures, which include both initial, one-off measures and activities to be carried out continuously.

This is because problems in data quality not only have an impact on the success of a data science project, but also have far-reaching consequences for the company as a whole. The good news for your data science project, however, is that you don't need the perfect data set. And some errors can be fixed by data scientists as part of data preparation. However, save yourself the costs and headaches with solid data quality management and our best practice checklist.

Jetzt unverbindliche Erstberatung vereinbaren

Ganz unabhängig davon, wo Sie gerade stehen. Unser Team steht Ihnen gerne in einem kostenlosen Erstgespräch zur Verfügung. Gemeinsam schauen wir in knapp 30 Minuten auf Ihre Herausforderungen und unsere Lösung.