How to master data science projects

According to the International Data Corporation (IDC), the global data volume is expected to grow to 175 zettabytes by 2025 — an unimaginably large number with 21 zeros.

The global data volume is expected to grow to 175 zettabytes by 2025, according to International Data Corporation (IDC) — an unimaginably large number with 21 zeros. Companies are responsible for a large part of these bytes and therefore have more and more data at their disposal, which is generally available for analysis. Due to the volume and heterogeneity of data, insights can no longer be obtained just by “looking closely” or purely statistical evaluations. Machine support for employees in data analysis is absolutely necessary. Against this background, data science is becoming increasingly important. Many companies are now aware of the potential. However, there is uncertainty as to what special features a data science project entails and what is necessary to successfully carry out such a project. Reason for us to show what makes data science projects so special, how a typical data science project works and at the same time to give an insight into the work of a data scientist.

What is data science?

Definition: Data science is an English term that consists of the two words “data” equals “data” and “science” equals “science.” Therefore, the term can be translated as data science. However, the English term is also usually used in German. In terms of content, this describes the extraction of knowledge from data.

The aim is to generate new insights for users by evaluating company data and thus create added value for corporate management. The aim is to improve the quality of entrepreneurial decisions and make work processes more efficient.

Data science is an applied science and an interdisciplinary discipline that results from the following fields of science: mathematics (in particular statistics), computer science & programming, and specific expertise.

Data science has application potential in almost all corporate functions and industries. Typical applications include:

  • forecasting: Forecast of, for example, sales and returns in retail & e-commerce, call volumes in call centers or incoming goods in logistics
  • Predictive Quality: Prediction and explanation of defectly produced parts for predictive quality management and the reduction of waste
  • Predictive maintenance: Predicting machine and component failures to determine the optimal maintenance time and prevent machine downtimes
  • Next Best Offer: targeted prediction of personal potential for additional sales

HOW DOES A DATA SCIENCE PROJECT WORK?

Based on the Cross Industry Standard Process for Data Mining, in short CRISP-DM, the typical approach of a data science project is discussed below. CRISP-DM has by far established itself as the most well-known and widely used approach for data science projects to ensure their quality and success. The CRISP-DM can be applied to any data science project across industries and aims to provide a uniform process model and step-by-step instructions for data science projects.

The CRISP-DM process is shown in the graphic below and comprises six process steps. This is by no means a unique, linear process, but an iterative process.

The six-part CRISP-DM process for successful data science projects (Source: Own presentation based on Smart Vision Europe, Phases of the CRISP-DM reference model)

STEP 1: BUSINESS UNDERSTANDING — THE UNDERSTANDING OF BUSINESS

Which problems and questions arise in the company? And can these be answered using data? The aim of the first step is a clearly defined question or project goal. It is important to find out which problem the employees of a company are facing and how this problem can be solved with data. The first phase is therefore all about finding the appropriate use case and defining clear goals and acceptance criteria for the evaluation.

WHICH IS THE RIGHT USE CASE?

With so many options, it is often difficult to choose: Where to start and which application is the right one? A task that involves certain challenges, but which can be met through early cooperation between specialist department and data scientists and with the help of the right methods. Support from external resources can often be useful here.

In practice, an initial workshop is usually held at the beginning to identify a more user-oriented use case with high business potential (such as cost savings, better customer experience, higher turnover or lower risk). The top priority is: The question and the data basis must go together! The early involvement of data scientists (whether internal or external) can help here, as he or she is already taking potential data sources and potential into account. This minimizes the risk that companies will set unrealistic goals as a result of an incorrect assessment of the data situation. In general, it has proven useful to focus on smaller use cases at the beginning in order to build up experience and achieve quick wins.

Once the right use case has been found, it is also important to determine the key performance indicators (KPIs) that define the success of a data science project. In addition to traditional controlling KPIs such as return on investment (ROI), it is important to think in particular about measuring real business value. Metrics that can be used to measure business value include: reducing transportation costs by 4% or improving cross-selling for items X and Y by 15%. In addition to measuring business value, additional goals related to usage and acceptance should also be considered, such as the number of users who actively use the results compared to the previous system.

This first step is often underestimated, but is central to the success of any data science project. Because only when it is clear how added value is generated for internal company stakeholders can everyone work towards a common vision.

As part of the first process step, the following questions must therefore be clarified:

— What is the company's problem?

— What are the project requirements?

— How can you ensure that the use case generates added value?

STEP 2: DATA UNDERSTANDING — UNDERSTANDING THE DATA

The database forms the basis of every data science project. Project success depends on the data. The aim of this second step is therefore to obtain an overview of the available data and to evaluate the quality of the data.

HOW MUCH DATA IS REQUIRED?

That is the question that everyone (including data scientists) would like to answer. Although it sounds like an easy thing, unfortunately it isn't. The frequently circulating statement “the more, the better” remains a wishful idea. While you may have collected data for decades, if this was done without a real purpose, it's likely that your data doesn't contain all the answers to the questions your company has.

In order to answer the question, in reality, it is necessary to consider several aspects that affect the amount of data required, starting with the use case to the complexity of the problem to be solved and the desired analysis method.

However, there is agreement that the quality of the data plays a decisive role. Which immediately leads to the next burning question: What is data quality? And what is good enough data quality?

HOW GOOD DOES THE DATA QUALITY HAVE TO BE?

Data quality describes how well the data sets are suitable for intended applications. In this step, the data scientist therefore checks whether the data provided by the company contains the data necessary to fulfill the project goal and whether it is worthwhile to supplement it with external data sources.

Fortunately, the quality of data can be determined based on criteria. In addition to the accuracy, relevance and completeness of the data, the evaluation criteria also include consistency and availability. Which specific attributes are used to evaluate data quality depends on the context.

In short, focus not only on the range and quantity of your data, but above all on its quality. Because if this is bad, even a sophisticated algorithm is useless.

In summary, the following questions must therefore be clarified as part of the second process step:

— What data is currently available?

— What data still needs to be collected?

— What data is necessary to fulfill the project goal?

— What are the data quality problems?

— How can the quality of the data be ensured?

Further information on data quality can be found in our blog post:
What is data quality and how good does my data have to be? ➞

STEP 3: DATA PREPARATION — DATA PREPARATION

The data preparation phase is used to create a final data set for subsequent analysis. This step often takes up the most time of a data science project.

This step essentially comprises three parts:

1. Consolidate data: The combination of data from often different sources into one analysis data set.

2. Data cleansing: The act of cleaning or cleaning the data by correcting errors in the data.

3. Feature Engineering: Developing additional variables from the adjusted data.

STEPS 4 AND 5: ANALYSIS AND EVALUATION — CREATING THE MODEL AND EVALUATING SUCCESS

In the analysis and modelling step, appropriate analysis methods are selected and applied to the problem. It is important to create models that model the initial problem with sufficient accuracy. The methods used for this can range from simple statistical methods to machine learning algorithms to more complex solutions in the field of artificial intelligence with image recognition and speech processing. This phase is therefore used for a so-called feasibility study, which is usually referred to by the English term Proof of Concept (POC).

It is therefore important to evaluate the results of the analyses. If the goals are not achieved, the analysis process can be revised and run again. While the evaluation of the model is usually based on the model quality, the evaluation of the success of the project is usually more far-reaching from the perspective of the specialist areas. In addition to the resulting business value, the trust and acceptance of employees and decision makers in the analysis results is an important factor for successful use. To achieve this, the decisions of the models must be made transparent and comprehensible and the results presented in a simple and understandable form. It is therefore important to find a balance between high accuracy, interpretability and achieving the goals and KPIs set in the first step (business understanding).

The POC creates the basis for decision-making for the further course of the project by ideally confirming the project concept. However, this step may also result in the data base being insufficient to solve the defined problem. In short, after the evaluation phase, it is finally time to decide whether a deployment will be carried out.

STEP 6: DEPLOYMENT — INTEGRATION INTO OPERATIONAL PROCESSES

If the analysis results confirm operational added value for the company and users, the last step is operationalization or deployment of the model, also known as deployment. For this purpose, an individual software solution is being developed which permanently integrates the previously created model and the analysis results into the company's IT infrastructure and operational business processes. This step is the key to using data profitably in the long term and achieving the actual added value of data science projects. Because the end result is rarely a singular, static analysis, but rather the development of tools that are intended to support the company in everyday life. This step is therefore all about developing the existing analytical prototype into your individual data product.

Among other things, the following two questions must be answered:

— How are analysis results provided?

— How is sustained improvement ensured?

For this step to be successful, a structured approach is also required. The completion of CRISP-DM and a successful POC are thus the starting signal for a subsequent project for systematic operationalization. At this point, the data science project therefore becomes a software development project.

A PRACTICAL EXAMPLE OF THE CRISP-DM PROCESS IN USE

In order to make the beginning and end, as well as the individual phases of the analysis process, more tangible, the following project example will be used. This involves the evaluation of customer feedback from a manufacturer and distributor of household electronics.The CRISP-DM process is explained by developing a tool for evaluating customer feedback.

CONCLUSION

Regardless of whether you want to predict future sales figures or machine failures, data science enables you as a company to exploit the full potential of your data. However, every data science project involves a certain degree of uncertainty, in that you can't always predict the outcome and challenges of the project — no matter how experienced the team is. In a data science project, everything ultimately depends on the data. However, it is not the sheer mass of data that is decisive, but the quality and significance of the data with regard to the question.

The following points, among others, are therefore decisive for a successful data science project:

— The question and the data basis must match. The data must therefore definitely be checked right at the start.

— A specific goal definition is required so that everyone involved in the project works towards a vision.

— The various tasks of a data science project require various skills and therefore the close cooperation of a team of different people. You can find out exactly which roles and competencies are required in our blog articles.

— An iterative and agile approach is necessary because new insights can be gained at every stage, not just at the end.

If these points are taken into account, data science can generate sustainable added value for your company and you can optimally acquire previously hidden knowledge.

WE WILL IMPLEMENT YOUR DATA SCIENCE PROJECT!

As data science experts, we at pacemaker are your point of contact for the implementation of your data project. We support you from the idea and the search for a suitable use case to seamless integration into your IT infrastructure and operational business processes.