Step-By-Step: Creating a Docker Image for Your Data Science Project

Docker is widely used for software development nowadays and has become the de-facto standard for creating scalable systems in the cloud together with Kubernetes. Sooner or later, many data scientists need to use Docker to deploy their projects or at least become interested in trying it out. This article will show you how to create a Docker image from your conda environment step-by-step.

We assume you have already installed Docker, besides this the article should be pretty self-contained. However, if you have never used Docker before, it may be a good idea to go through the official Docker quickstart first.

Project Skeleton

First, let’s set up a basic project skeleton for which we will develop the Dockerfile later on. The source code, containing the final Dockerfile, is also available on GitHub.

As the projects' python code, we will use a simple hello world program located under src/hello.py:

import click


@click.command()
@click.option("--who", default="world", help="Whom to greet.")
def say_hello(who):
    """ Simple CLI to greet someone! """
    print(f"hello {who}!")


if __name__ == "__main__":
    say_hello()

As you can see, it uses click to create a simple CLI that prints out a greeting.

Conda allows us specify our projects' dependencies in a yaml file called environment.yml:

name: my_project
dependencies:
  - python=3.7.6
  - click>=7.1

The only requirement are a modern python version and the click library. If you have never used a conda environment file before, check out our article on managing python dependencies in data science projects to find out why that’s a good idea.

Now, we are only missing the Dockerfile itself and we will go through that one in more detail below. If you want to follow along with the steps, simply create an empty file for now. The project folder should now look like this:

.
├── Dockerfile
├── environment.yml
└── src
    └── hello.py

Dockerfile

To get our project up and running inside a Docker container, at the very least we have to do the following steps:

Choose a base image
Set-up a python environment with the required libraries
Add the project source code

Choose a Base Image

The first step in creating your Dockerfile is to choose a base image. It will provide a root file system, pre-installed software and some basic configuration. This step already presents us with a multitude of choices — there are over 100.000 images available on Docker Hub! Even if we narrow it down to a python project, there are many options like starting with a vanilla OS image, using the official python image or choosing a more specialized one.

Since many of us data scientists are working with Anaconda / Miniconda python distributions, we will choose the official continuumio/miniconda3 image here. At the time of writing, it is based on the debian:buster-slim OS image. For beginners, debian-based images are great since they are widely used and you will find a ton of online resources about those. However, they are also generally a good choice for python projects.

The official miniconda3 image already has a miniconda python distribution installed. It also adds conda’s base environment executable directory to the PATH (as can be seen in the Dockerfile), so you can run python related commands directly without knowing where they are located. This will become important later on.

For now, let’s choose it as our base image by putting a FROM instruction into our Dockerfile:

FROM continuumio/miniconda3:4.8.2

Note, that I chose a specific version tag for the base image (the most recent version at the time of writing). There are some good reasons why we should not simply use the image with the latest tag.

Now, we can build the first version of our project Docker image and tag (-t) it as my_project:latest by executing the following command in the project folder:

docker build -t my_project:latest .

Let’s now run the container to enter the shell and run some commands to inspect what it looks like. In order to use a shell, make sure to set the interactive (-i) and TTY (-t) flags:

docker run -it my_project

To find out which python version is currently installed we run:

python --version

We find out that Python 3.7.6 is installed.

It is also interesting to find out where the python executable lives:

which python

It says /opt/conda/bin/python. If you’re interested to see a full list of the installed conda packages and their versions, simply run a conda list.

Actually, we could have found out all of the above by simply running a container off the base image itself. But this way, we already tested building and running our own image.

It’s time to add our requirements to make it more useful.

Set-Up a Python Environment With the Required Libraries

On our local machine we can simply run

conda env create

in our project folder to create a conda environment with the desired libraries. Then, we need to activate the environment with

conda activate my_project

before we can actually run our python programs. Doing the same in a Docker image, however, is not as straightforward.

It turns out, we have two options here:

Create a new conda environment and use conda run to run our program as described here.
Update the base environment with our requirements.

I am using the latter option in my daily work for three main reasons:

The Dockerfile is easier to read and understand than the one needed for Option 1.
If you run a shell in the container, you can directly interact with the projects' python environment without manually activating a conda environment or creating a custom entrypoint that does that for you.
If your projects runs under the python version of the base image, your image will stay smaller since there is no need to install a second python version. (This is the reason why I pinned the python version to 3.7.6 in the conda environment file above!)

This is how the Dockerfile looks after adding two more commands to create the desired python environment:

FROM continuumio/miniconda3:4.8.2

COPY environment.yml /opt/env/
RUN conda env update -n base -f /opt/env/environment.yml \
    && conda clean -afy

The conda environment file is copied into the folder /opt/env/ and then the conda base environment is updated. The name of the project environment is overridden by the -n flag so the base environment is updated. This is great, since you can still use the environment called “my_project” locally on your system without renaming it to “base”.

Also, conda clean is run to clean up index and package caches to keep our image small. It is important to run the update and clean commands in the same RUN instruction. Every RUN creates a new layer in the image and previous layers are immutable. If we ran the conda clean command in a separate RUN, the deleted files would not be visible anymore. However, they still would take up disk space and the files could still be extracted from the image.

If you want, build the image again, run it and execute conda list inside the container to verify that the correct requirements are now installed.

Add the Project Source Code

Now that our project environment is ready, we can copy the project source code and add a CMD instruction so that our program starts when the container is run without additional arguments:

FROM continuumio/miniconda3:4.8.2

COPY environment.yml /opt/env/
RUN conda env update -n base -f /opt/env/environment.yml \
    && conda clean -afy

WORKDIR /opt/src
COPY src/ /opt/src/
CMD [ "python" , "hello.py" ]

If we build the image again and run

docker run -t my_project

we are finally greeted with “hello world!”. Note, that we set the working directory (WORKDIR). That means we can reference our program hello.py directly instead of providing the full path. Also, we start off in the project folder if we run a shell in the container with docker run -it my_project bash.

Great! We were able to do all this by using only six instructions. But are we really done yet? We could certainly use the image above. However, there are some details that should be considered for deployment.

Advanced Considerations

Firstly, in the above image, the python program runs under the root user inside the container. Most of the time, particularly if you deploy a machine learning model, root privileges are not needed by your program. As a general rule, if no elevated rights are needed the program should run under a non-root user. This is also stated clearly in the Docker best-practices. We will add a non-root user in our final version of the Dockerfile below.

Secondly, the python program runs as PID 1. Normally, in a Linux operating system init would be running as the first process. The PID 1 process has some special responsibilities, for instance forwarding signals. Since most of the python programs we write do not implement handling signals, a SIGTERM (which is sent by docker stop to gracefully exit the container) will be ignored by our container. That’s why docker stop will resort to forcefully terminating the container using SIGKILL after ten seconds.

This can be easily verified by printing the greeting in an endless loop every second in the python program above and then trying to stop the container using docker stop (which is what our cluster manager might try to do). You will see that the container ignores the polite request to stop printing "hello world" and will be forcefully removed from the club by the SIGKILL bouncer ten seconds later.

This behavior can lead to problems, which is why it is better to handle this case. Again, we have multiple options here. If you are using Docker 1.13 or later and you are able to specify run arguments, you can add the --init flag to your run command:

docker run --init -t my_project

It will use the built-in version of tini as a lightweight replacement for init to run as PID 1.

A better alternative is to add tini (or dumb-init) to the image in any case, since you may not be able set the --init flag in your deployment environment.

Adding a non-root user, installing tini and setting it as the entrypoint we arrive at the final version of our Dockerfile:

FROM continuumio/miniconda3:4.8.2

COPY environment.yml /opt/env/
RUN conda env update -n base -f /opt/env/environment.yml \
    && conda install --no-update-deps tini \
    && conda clean -afy

RUN useradd --shell /bin/bash my_user
USER my_user

WORKDIR /opt/src
COPY src/ /opt/src/

ENTRYPOINT [ "tini", "-g", "--" ]
CMD [ "python" , "hello.py" ]

This final version of the Dockerfile including the example project code is available on GitHub.

That’s it for now and I hope you found this step-by-step walk-through useful! I’d love to hear about your workflow and suggestions to improving mine. Just write a mail to mail@haveagreatdata.com.