Data Science Use Case Evaluation
This guide provides a three step process for evaluating and prioritizing data science use cases. First, information is gathered about the projects to enable informed decision making. Then, obviously bad candidates are delisted, after which the remaining use cases are prioritized.
Understanding Use Cases
In order to understand a use case well enough to make a decision about its feasibility and prioritize it, some information has to be gathered. Asking the below questions facilitates a discussion that aims at bringing use cases from vague ideas to concrete project concepts.
Problem Statement
-
What is the business problem to be solved?
-
How is the problem being solved today? What approaches have been tried in the past to solve the problem? Why have they failed or need to be improved?
Business Case
-
What is the desired outcome of the project in terms of a business metric[1]?
-
How does your project influence that business metric?
-
What is the order-of-magnitude ROI and cost? State assumptions. Use back-of-envelope math.
Delivery
-
Who will consume this and what form factor will the final product take (report, app, API)?
-
What are other possible delivery mechanisms, especially ones that are lighter weight or easier to test first?
-
What user training is necessary to ensure adoption?
Success Measures
-
How will you know if it’s working as expected, or otherwise get feedback?
-
What is your “monitoring” plan, even if it’s manual and subjective?
Required Resources and Support
-
What data will be required? Is it readily available in, e.g., the data warehouse or does it need to be gathered, first?
-
Who are the subject matter experts and do they have enough free capacity to work on this project?
-
What support is required from other parts of the business, e.g., IT?
-
Is there enough buy-in in the business for the solution to be tested and potentially adopted?
When is machine learning NOT a good idea?
This is a list of criteria that can be used to make a quick decision, whether one should go further with a project idea or instantly reject it.
1) No data (due to budget or access)
If for some reason it cannot be guaranteed that the required data can be made available to the project team (in time), this is a showstopper. The reason could be that the data is not being collected yet (i.e., not available digitally), or that provisioning the data for the project is too expensive or time consuming, or that there is, e.g., no capacity in the BI team for provisioning the data in the forseeable future, etc.
Note: When supervised machine learning is to be used, the lack of "labels" also falls into this category.
2) A rules-based solution works
If there is a small and simple set of rules that stays stable over a long period of time and solves the problem at hand, this is a software engineering project, not a machine learning project and should be handled by the corresponding team (i.e., not the data science team). If there is a system of a big number of fuzzy and complicated rules that need to be updated often, machine learning might be better suited than traditional software.
3) Low ROI for your business
This should be obvious, but it is commonplace when it comes to machine learning project ideas, one of the main reasons being the one described in 6).
4) No tolerance for mistakes
Machine Learning models are statistical models and it must be expected that even if a model works very well, it will make mistakes some of the time. The problem that is to be solved in the project has to be of such a nature that it allows for making mistakes some of the time.
Examples:
-
Fraud detection: A machine learning model decides whether a case is fraudulent or valid. If a fraudulent case slips through the system, this means increased costs to the business, but the whole premise of the fraud detection system is not invalidated. In fact, having an automated system with an error rate higher than that of a team of humans might still be a good option, if the cost benefit due to reducing human workload offsets the increased costs of missing some fraudulent cases.
-
Product classification for customs declaration: If a machine learning model classifies a product into the wrong Harmonized System[2] category, the company might end up breaking the law when shipping the product across country borders.
Note: Oftentimes, machine learning systems can be used to provide suggestions to humans instead of making final decisions and still be usefully utilized for cases where every single decision must be correct.
5) No one to maintain it
Depending on the outcome of a project, there can be technical artifacts that need to be maintained long term, e.g., model APIs, batch processing pipelines, etc. Due to the changing nature of reality, the performance of machine learning models usually needs to be monitored over time and models need to be retrained on fresh data regularly. Without a plan on who will do the maintenance, it doesn’t make sense to start building a solution.
6) You just want something cool
This point should speak for itself, but it is worth mentioning that having your project ripped apart at the final presentation by a simple question like "What is the ROI / business impact of this project?" after working on the project for months is a disheartening experience.
Prioritization of Use Cases
After kicking out unfit projects in the above section, the remaining projects need to be prioritized to determine, what will be worked on (first).
There can be many unquantifiable factors that go into the prioritization of use cases like interdependencies between projects or office politics.
Further factors are the applicability of data science methods (is this even a data science project?) and the chance that the selected data science method performs well. There is often no simple clear cut answer to the first question, as there might be many different ways to solve a given problem. As to the latter question, it can generally only be determined how well a machine learning system works, after it has been built and tested. However, for some use cases one can have an intuition that a machine learning solution is very unlikely to work well.
The following is a list of quantifiable factors that can be used in the prioritization process:
-
Reach (How many employees or customers or sales opportunities, etc. will be impacted?)
-
Impact (How strongly will something be improved?)
-
ROI (in terms of newly generated business, saved costs or saved FTEs)
-
Effort / Project Costs
-
Chance of creating a satisfactory solution for the business problem
-
Confidence (How high is the confidence in the estimated numbers for the above factors?)
One way of using these quantifiable factors is to draw an area with two axes (e.g. business benefit and effort) and placing the projects on the area to get a better picture of the tradeoffs.