How I Manage Python Dependencies in Data Science Projects

The sheer size of the Python package management ecosystem can be daunting not only to beginners, but also to Python veterans as the number of available package managers seems to be ever-increasing. Should you go with the default pip and its requirements.txt file, or should you use Pipenv and a corresponding Pipfile? Maybe you should go with the modern Poetry and a pyproject.toml. Or should you rather use the data science specific Conda and an environment.yml file?

Each tool has its own way of doing things, which might, or might not fit your workflow. There is no one perfect tool for every use case and to decide, which one fits you best takes some digging into the details. On top of the number of options, you will face the problem that each of the package managers might have multiple ways of doing the same thing. One example is that you can use pip to install packages individually with

pip install <package_name>

while you can also keep your dependencies in a requirements.txt file and run

pip install -r requirements.txt

to install them. As you can see, there are many ways to do package management in Python.

Below I show you the way that works best for me in my data science projects.

My Approach

I use the Anaconda Python distribution and the conda package manager. Conda is not only a package manager, it also manages environments. Each conda environment contains its own set of Python packages in addition to its own version of the Python interpreter.

Each project I work on gets its own conda environment. This way, the base environment is not littered with packages of past projects. It also means, that there are never conflicts between the required package versions of different projects. Each environment is lean because it contains only the packages required for a single project. When the project is over, I can remove its dependencies by deleting the entire environment.

I never install packages directly with

conda install <package_name>

Instead, I specify a project’s dependencies in an environment.yml file and create the conda environment from this file. Whenever my dependencies change, I change the file and use it to update the environment. This way, the file always reflects the current state of my working environment. Since this file is part of the project’s git repository and is versioned together with the code, collaborators can easily recreate the environment on their machine and run the code.

This is what an environment.yml looks like:

name: great_data_science_project
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8.2
  - pandas>1.0.0
  - jupyter
  - ipython
  - pylint
  - invoke
  - joblib
  - black
  - nb_black
  - pip
  - pip:
      - azureml-sdk==1.4.0

As you can see, the file contains

the name of the environment
the conda channels to use (What are conda channels?)
the environment’s Python version
a list of conda dependencies with optional version numbers
a list of pip dependencies with optional version numbers

Being able to specify pip dependencies is very handy, as it means you can use conda and an environment.yml for package management and still have access to all of PyPI. Generally speaking, you should use packages from the conda channels whenever possible and only install those packages with pip that are not available in conda channels (Why?).

The ability to specify version numbers is useful if you need a specific feature from a Python package that was only introduced in a certain version. Another reason to specify a version is to shield yourself from potentially backwards-incompatible changes in future versions of a package. Note, how version specification has to be done differently for conda and for pip dependencies.

Managing Environments

To create an environment, you simply run the following command from within the directory containing the environment.yml file. Conda automatically looks for a file with the name environment.yml in the current working directory and uses it to create the environment.

conda env create

To activate it, run

conda activate great_data_science_project

After the environment has been activated, a command like, e.g.,

jupyter notebook

will run the corresponding executable (jupyter) from the activated environment.

If you want to activate an environment, but can’t remember its exact name, you can list all environments available on the machine with

conda env list

Whenever your dependencies change, you edit the environment.yml file and run

conda env update --prune

The --prune argument makes conda remove packages that are not required anymore according to the dependency file. When omitted, conda only adds new dependencies to the environment and never removes any.

To remove the whole environment, run

conda env remove -n great_data_science_project

Although conda is a tool with many features and different ways of doing things, the above five commands are virtually everything I ever need. The result is a set of cleanly separated environments, which always reflect their revision-controlled dependency specifications, which is a big win for reproducibility.

If you need more determinism than this, there are ways to exactly reproduce conda environments. To create lock files for different platforms, conda-lock can be used.

One topic that is not covered in my approach is the separation between dependencies that are only required during development time and dependencies that are required during runtime (dev dependencies). This appears to be a problem without a straightforward solution when using conda.

That’s it. That is how I manage Python dependencies. What do you think about my approach? What are you doing differently? Shoot me an email at mail@haveagreatdata.com. I’d love to hear your thoughts!