BossaBox

This is the playbook for engineering-playbook

Envisioning and Problem Formulation

Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful.

Envisioning goals

The main goals of the envisioning process are:

The envisioning process usually entails a series of ‘envisioning’ sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution.

Understanding the problem domain

Generally, before defining a project scope for a data science investigation, we must first understand the problem domain:

However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps:

Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame.

Listening to the end user

These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project.

The following questions can help guide discussion in understanding the stakeholders’ perspectives:

Envisioning Guidance

During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2].

Problem Framing

  1. Define the objective in business terms.
  2. How will the solution be used?
  3. What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system?
  4. How should performance be measured?
  5. Is the performance measure aligned with the business objective?
  6. What would be the minimum performance needed to reach the business objective?
  7. Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times)
  8. Frame this problem (supervised/unsupervised, online/offline, etc.)
  9. Is human expertise available?
  10. How would you solve the problem manually?
  11. Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?)
  12. List the assumptions you or others have made so far. Verify these assumptions if possible.
  13. Define some initial hypothesis statements to be explored.
  14. Highlight and discuss any responsible AI concerns if appropriate.

Workflow

  1. What data science skills exist in the organization?
  2. How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)?
  3. What does the team’s current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used?
  4. How are data, experiments and models currently tracked?
  5. Does the team employ an Agile methodology? How is work tracked?
  6. Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions?
  7. Who would be responsible for maintaining a solution produced during this project?
  8. Are there any restrictions on tooling that must/cannot be used?

Example - a recommendation engine problem

To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3].

Often, the objective may be simply presented, in a form such as “to improve sales”. However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season?

A better objective, in this case, would be “to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation”. Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc.

The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer’s likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model.

We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following:

Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us:

We would then be able to use this data to explore:

which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be:

Next Steps

To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps.

We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops.

Below are the links to the exit document template and to some questions which may be helpful in confirming resource access.

References

Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended.

  1. Aurélien Géron’s Machine learning project checklist
  2. Fast.ai’s Data project checklist
  3. Designing great data products. Jeremy Howard, Margit Zwemer and Mike Loukides