If you are a data scientist or plan on becoming one, you probably know this short sensation of unease in your stomach every time you see somebody on the internet claim that data science is ripe for automation. Just yesterday I read the following question directed at a data scientist in a discussion on Hacker News:

Can you please describe what part of your job CANNOT be automated?

Although it was a serious question (see the context), I feel the question has a threatening ring to it.

Before I go on to give my own answer to this question, let’s first understand where it comes from. What is that thing that is supposed to automate the data science job? The answer to this is automated machine learning. AutoML aims at automating data preprocessing, model selection and hyperparameter tuning. Given an AutoML solution you can get from a clean dataset to a well-trained machine learning model in a single fully automated step.

Why AutoML will not automate data science

The CRISP-DM process consists of six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. What becomes immediately apparent is the fact that the only phase covered by AutoML is the modeling phase. All other phases still need to be done by a human.

A data science project starts with a business problem, not a clean dataset of features and labels and typically, the person hiring the data scientist doesn’t have the required background to decide, if a problem can be solved with machine learning, let alone devise what the solution should look like.

Also, I have yet to encounter a project, where the data to be used is clean and immediately usable. In fact, data is always dirty and it is in so many different ways (people seem to get very creative when entering data into systems).

Further, someone has to define a suitable error metric for the model training and make an educated evaluation of the result. Given that machine learning has such a broad and diverse field of applications, most machine learning scenarios are non-standard in one way or the other, making it very hard to standardize and automate all of the aforementioned tasks.

Finally, if the ML model doesn’t behave as intended, one will have a hard time fixing the problems without understanding what is under the hood.

The job of a data scientist is a very multi-faceted one. That’s what makes it so attractive to many of its practitioners. If data science consisted of only data preprocessing, model selection and hyperparameter tuning, we probably wouldn’t be afraid to loose our jobs to machines.

This is it. That’s why I think data scientists can rest assured that they will be able to keep their jobs for the forseeable future, at least until AGI becomes available. But that is a whole different discussion ;).

Side note on AutoML

I actually think AutoML will be welcomed by many data scientists because it automates some of the boring parts of the job. However, as of now it has some downsides:

  • From grid search to random search to black-box optimization, finding the optimal combination of data preprocessing operations, model selection and hyperparameters has become more and more efficient, but AutoML still needs a lot of iterations to end up at that optimal combination
  • Training a high number of models to arrive at the optimal setup is very expensive (in terms of actual money) no matter if you train on premises or in the cloud
  • In many contexts, it is not important to end up with an optimal set of parameters. An OK set of parameters already provides most of the business value and increasing the model performance by 0.1 percent doesn’t justify doubling the project budget. The real world is not Kaggle.