Defining success: A guide to effective problem formulation in data science

8 Nov

Written by Simon Althoff

In data science, the formulation of the problem is a critical step that significantly influences the success of any project. Properly defining the problem not only sets the direction for the entire analytical process but also shapes the choice of methodologies, data collection strategies, and ultimately, the interpretation of results. For data scientists, a well-formulated problem helps in honing in on the right questions to ask, allowing them to design experiments and models that are aligned with business objectives. It ensures that the analytical effort is relevant and impactful, leading to actionable insights rather than merely technical achievements.

For business leaders and other stakeholders, understanding how to identify and formulate problems is equally essential. They must be able to articulate their goals and challenges clearly, enabling data scientists to grasp the nuances of the business context. This collaboration is crucial; when leaders present a vague or misaligned problem, it can lead to wasted resources, misguided analyses, and missed opportunities. Conversely, when both data scientists and business leaders engage in a constructive dialogue about problem formulation, they can uncover deeper insights that address root causes rather than just symptoms.

Moreover, the landscape of data science is continually evolving, with new tools and methodologies emerging regularly. A strong foundational understanding of problem formulation allows both parties to adapt to these changes effectively. By fostering a culture of clear communication and shared understanding, organizations can better harness data to drive strategic decision-making and innovation. Ultimately, the ability to precisely define and address the right problems is what transforms data science from a technical exercise into a powerful driver of business success.

Key aspects to formulate a Data Science problem

Defining a data science problem requires a careful balance of three key perspectives: business goals, operational considerations, and technical considerations.

Business goals

The first step is to align the data science problem with clear business goals. Understanding the organization’s strategic objectives—whether it’s increasing market share, enhancing customer experience, or optimizing operational efficiency—is crucial. Engaging stakeholders across departments helps ensure that the problem definition accurately reflects the organization's needs. Establishing success metrics at this stage is also essential, as it provides a way to evaluate the effectiveness of the data science solution in achieving these goals.

Operational considerations

Operational considerations focus on the practical implementation of the data science solution. This includes assessing existing workflows, determining who is responsible for current relevant processes, who will have ownership of the finished solution, and identifying any resource constraints. Clear ownership is vital; defining who is responsible for the project and its outcomes can prevent misunderstandings and ensure accountability. Additionally, understanding the organizational culture and readiness for a data-driven approach is critical. If there are gaps in skills or knowledge, it may be necessary to provide training or hire new talent. Considerations around scalability and how the solution will fit into current operations are essential to ensure sustainable implementation.

Technical considerations

Technical considerations encompass the data as well as methodologies and technologies required to analyze the data effectively. This involves selecting appropriate algorithms and techniques for the problem at hand, as well as ensuring that the necessary data infrastructure is in place. Evaluating data quality, availability, and compliance with privacy regulations is crucial, as these factors directly influence the feasibility of the project. Furthermore, understanding the computational resources needed and the integration of the data science solution with existing systems will help streamline workflows and enhance collaboration.

By integrating these perspectives—business goals, operational readiness, and technical feasibility—organizations can define data science problems that are not only relevant and impactful but also executable within their unique contexts.

Understanding the technical considerations

Understanding the technical considerations of data science is crucial for all stakeholders involved, not just data scientists. A high-level grasp of the technical side fosters better communication and collaboration, ensuring that everyone is aligned on the capabilities and limitations of data-driven solutions. This shared understanding can facilitate informed decision-making, allowing teams to set realistic expectations and effectively manage resources.

At the heart of many data science projects are machine learning (ML) models, which can be viewed as function approximations. Essentially, these models learn patterns in data and use these patterns to make predictions or decisions based on new inputs. By treating ML models as approximations of underlying functions, stakeholders can appreciate their role in translating complex data into actionable insights. This perspective helps clarify how models can generalize from training data to real-world applications, emphasizing the importance of proper problem formulation, quality data and appropriate model selection in achieving reliable outcomes.

In summary, a foundational understanding of the technical aspects of data science empowers teams to engage more effectively with data-driven initiatives, ultimately enhancing the success of their projects.

Categories of Data Science problems

To gain a better understanding of the technical side of Data Science, let us categorize the different different types of problems that ML models solve. This list is not exhaustive, and category definitions may vary based on who you talk to, though it should serve as a good starting point.

Classification problems

Perhaps one of the most classic problems to encounter, especially in machine learning tutorials. The model is served input data and outputs one or several predefined classes that the input belongs to. For instance the MNIST dataset, which is commonly used in introductions to ML, has a large array of pictures of handwritten digits, where the model should classify which digit it is. Other potential classification problems are:

Medical diagnosis (classify positive or negative)
Classify emails as spam or not

Regression problems

Anyone with experience in statistics has probably encountered a regression problem before, linear regression being one of the most fundamental techniques used in the field. In general, regression type problems involve predicting continuous numerical values. If one were to predict discrete values, you would have a classification problem instead. Examples include

Predicting the price of something (a stock for instance, though it is difficult)
Predicting the temperature, wind speed or any other numerical weather data
Predicting the number of passengers on a public transport (this is naturally a discrete number, though depending on the problem, it is usually preferable to define it as a regression problem)

Clustering problems

These problems involve grouping together different data points based on how similar they are. This sounds similar to classification, the main difference here being that there is no clear target value, we have not predefined a set of classes that the data should belong to. Examples include:

Analyze what products are typically bought together
Analyze segments within a market, for instance different groups customers belongs to
Text embeddings, models that extract the semantic meaning of words and texts by analyzing which words often occur with each other

Each of these problem types have their own specific models and methods for solving, where choice of model will depend on the specifics of the problem that is to be solved. The process of tackling a problem in data science usually involves putting the problem in one of the buckets: classification, regression or clustering. Many problems could potentially be solved in several different ways, where experience and understanding of the limitations of different models and techniques play in. For instance, predicting the number of passengers on a cruise ship might look like it could be a classification problem, however setting it up as a regression model and rounding the result to whole numbers is likely more appropriate, due to the sheer number of classes one would have to define and how that affects the predictions.

Other problems

There are many other types of problems that are frequently discussed in different situations. These are useful to be familiar with, however one can usually make the claim that they belong to one of the three main problems presented above. Here are some examples:

Time series problems

Problems that involve forecasting data in a time series format
Time series is data that has a temporal component, for instance stock prices over time is a type of time series
Time series have their own type of methods and models for forecasting, but one can make the claim that they generally fall under the regression umbrella

Recommendation systems

Serving recommendations to users based on factors like previous preferences
Such systems can be found for instance on streaming services that recommend movies and shows to you
Can be claimed to fall under the clustering umbrella

Generative AI

Usually seen as a separate thing from “Classical ML”, use-cases does deviate and thus it may be beneficial to talk about it as something else
However, LLMs for instance, can be formulated as sequential token classification, hence we can make the argument that they fall within the classification type

Optimization

Generally separate from Machine Learning (not counting the dependence on optimization for model training), though often talked about within the same contexts
Awareness of the similarities and differences between optimization and ML is good to have when discussing potential avenues for solving a problem

Again, this list is not exhaustive, but it gives an overall picture of how problems can be defined for an ML model. This gives a good framework to use when thinking about possible ML problems, especially in the idea phase. Let's look at an example to make this more clear.

Example

Let’s imagine a hypothetical company which specializes in creating sustainable packaging solutions for businesses. They offer biodegradable, compostable, and recyclable packaging products for various industries, including food, e-commerce, and retail. They have two business goals

Increase Market Penetration
Enhance Product Development

To reach these goals our hypothetical company has identified a few different initiatives

Targeted marketing

Here we are aware of clustering algorithms that can help us segment the market into categories for better targeted marketing

Analyze market trends

By defining a regression problem we can create models that forecast demand within different markets

Analyze customer feedback

We can use a language model to classify the sentiment regarding certain products, increasing efficiency in feedback analysis

By using the goals as a basis, we can find areas where data science projects can assist in reaching those goals. We see clearly how knowledge of the different types of problems gives a framework for finding potential use-cases. Though we should not limit ourselves to those types of problems either, since it might be restrictive. Optimization and other types of problems may be a better match for your needs, providing clearer opportunities. While the problem can still be categorized under the three main types, their fit to the other types might be more intuitive.

Conclusion

In conclusion, correctly formulating a data science problem is essential for achieving meaningful results. This process hinges on three key perspectives: aligning with business goals, considering operational factors, and addressing technical requirements. A strong grasp of technical aspects, regardless of your position, significantly improves communication and enhances problem formulation.

Data science challenges typically fall into three primary categories: classification, regression, and clustering. While other types exist, they often fit within these main categories. By familiarizing yourself with these classifications, while simultaneously not limiting yourself to them, you establish a valuable framework for developing and analyzing projects that can drive a business toward its long-term objectives. Embracing this structured approach will ultimately enable more effective problem-solving and contribute to sustained success in the ever-evolving field of data science.

AI strategy

Frida Holzhausen