Docker for data-scientists

Overview

Docker is a platform to develop, deploy, and run applications inside containers. A Docker container is similar to a physical container, it allows to pack and hold things, it’s portable and can be run locally or on a cloud provider in a similar way, and it’s provides clear interfaces to access the content.

Data scientists work on complex experiments that need to be portable and reproducible, they need to share results with the rest of the team, and they need to be able to rerun the same experiments and reach the same results.

Docker facilitates sharing machine learning workload and allows to run it across different environments by encapsulating the code and all its dependencies in a container. This container abstraction makes the code self-contained and independent from the operating system.

docker terminology

Before using Docker, there are a few terms you should be familiar with:

Image: An immutable (unchangeable) snapshot that contains the source code, libraries, dependencies, tools, and other files needed for an application to run.
Container: A virtualized run-time environment where users can isolate applications from the underlying system. These containers are compact, portable units in which you can start up an application quickly and easily.
Dockerfile: A specification file containing a list of commands to call when creating a Docker Image.

Why data-scientists should use dockers

When writing code for data science or machine learning projects, using docker allows to alleviate the following concerns:

Ensure that the application will work on all environments in the same manner.
Avoid the trouble of handling dependencies and installation problems.

In other terms, the main advantage Docker provides is standardization by defining the specification of the container once and running it as many times as possible, this means that a data-science project will benefit from the following advantages:

Reproducibility: Everyone has the same OS, the same versions of tools, etc.
Portability: This means that moving from local development to a super-computing cluster is easy.

Using docker

Install docker

In order to use docker, you will need to install it first.

Creating a dockerfile

The next step would be to create a dockerfile to specify how the application should work. A dockerfile is a text file that contains instructions to build an image. the Dockerfile specification provides commands like FROM, COPY, RUN, CMD, etc.

FROM: Every Dockerfile must start with the FROM instruction which is the base image.
COPY: Copies files from a local system onto the Docker image.
ENV: Used to define environment variables.
RUN: Instruct Docker to execute commands. For example to install a set of requirements users can add a step RUN pip install -r requirements.txt.
WORKDIR: Sets our working directory. For example WORKDIR /app.
EXPOSE: Exposes a port.
ENTRYPOINT: This is the command that will be executed when the image is run as a container.
CMD: This command specifies the program or file that will be executed when the container initializes.

An example of a dockerfile that could be used in a data-science project:

# Base image
FROM python:3.9
# Working directory
WORKDIR /app
# Copy files to the working directory
COPY . .
# Install requitrements
RUN pip install -r requirements.txt
# Command that runs when container starts
CMD ["python", "-u", "/app/main.py"]

Build the image

Now that Dockerfile is created, it is important to create a binary artifact based on it, i.e. Docker image. This is important to both run the dockerfile as well as to share the docker image with other users. If the image is not shared with other users, every time a user needs to execute the the logic of the Dockerfile, they will have to rebuild the image. Sometimes rebuilding image is slow and if the requirements are not pinned it can result in a slightly different image.

To compile the dockerfile, docker expose a build command:

docker build . -t ml-project:0.2

This builds an image and store it on the local machine. The -t flag defines the image name as “ml-project” and gives it a tag “0.2”.

You can always list all the images visible on the local machine:

docker image list

To delete an image

docker rmi ...

Run the container

We are able now to run our machine learning project by making an instance of the image as a container.

docker run ml-project:0.2 ...

By using docker run we have access to a reproducible execution of the code in any environment that supports Docker. A container can run until it ends, to list the containers use:

docker container list

Docker CLI provides several commands to manage containers, to stop a container based on its ID:

docker stop CONTAINER_ID

To completely destroy a container based on its ID:

docker rm CONTAINER_ID

To stop and delete the container at the same time:

docker rm -f CONTAINER_ID

Best practices

Handling datasets

Although it’s possible to package a dataset within a docker image, it’s generally better to keep the image light and mount the data at runtime using a volume or directly by accessing the data from the cloud storage.

Ignoring patterns

You can configure your docker build process to avoid accidentally loading data or large folders containing training datasets. Docker provides a config file called .dockerignore that allows users to specify paths and patterns, e.g.

# ignore files patterns
*.png
*.jpeg
*.wav
*.rar

# ignore dataset folders
imaging-datasets/

Avoid saving secrets

In machine learning projects, users will be dealing with datasets stored on S3 or GCS, or will need to load data from a datawarehouse or a database, it’s very important to remember that you should never build your images with passwords, tokens, key codes, certificates, or any other sensitive data, even when the repository and image are private. Your secrets should be exposed as environment variables or requested from a vault system or mounted from a volume at runtime.