See the video version of this Lab Teaching on YouTube.
A brief note — I will use the term “Docker” to refer to a container management system, even though Docker is not the only one. I will use the term “container” to refer to a running instance of a Docker image.
A container runs a single process, taking no more memory than any other executable, while virtual machines (VMs) run a full-blown “guest” operating system with virtual access to host resources through a hypervisor. In general, containers are considered much more lightweight. Containers are becoming popular for reliably deploying applications because they are:
Anaconda is a distribution of Python and R and many of their most popular packages for scientific computing. It is a great tool for data scientists and analysts who want to have a consistent environment for their work. However, it is not a container. Containers are a way to package an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the containerization software, you can rest assured that the application will run quickly and reliably from one computing environment to another.
Docker is useful for scientific computing because it allows you to package your entire scientific computing environment into a single “image” that you can easily share with others. This image can be run on any computer that has Docker installed, regardless of the operating system or other applications that are already installed.
This means:
Our first venture into Docker will be to create an Image (blueprint) and then use that image to create and run a simple container.
FROM ubuntu:20.04
LABEL maintainer="Jordan Matelsky <tutorial@example.com>"
CMD echo "Hello World"
This is a simple Dockerfile that will create an image that is based on the latest version of Ubuntu (20.04). It will also add a label to the image that includes the name and email address of the person who created it. Finally, it will run the command echo "Hello World"
when the container is started.
To build this image, we will use the docker build
command. This command will look for a file called Dockerfile
in the current directory and use the instructions in that file to build the image.
docker build -t hello-world .
The -t
flag lets us specify a name and tag for the image that will be created. In this case, we are using the name hello-world
and the tag latest
. The .
at the end of the command tells Docker to look for the Dockerfile
in the current directory.
You can see a list of all of the images that you have built on your computer by running:
docker image ls
You should see output that looks something like this:
REPOSITORY TAG IMAGE ID CREATED SIZE
hello-world latest 4ab4c602aa5e 3 minutes ago 72.9MB
ubuntu 20.04 4ab4c602aa5e 3 minutes ago 72.9MB
The first time we build an image, Docker will download the base image (in this case, Ubuntu 20.04) from Docker Hub, a registry of Docker images. This image will be stored on your computer so that you don’t have to download it again the next time you want to build an image.
Docker tends to be pretty smart and will only re-run the steps that are necessary to update an image. For example, if we change the CMD
in our Dockerfile and then run docker build
again, Docker will only have to re-run that last step, since the rest of the image is the same.
FROM ubuntu:20.04
LABEL maintainer="Jordan Matelsky <tutorial@example.com>"
CMD echo "Greetings, World!"
Note that just changing the Dockerfile is NOT enough to change the image on disk. If we run a container built from the current image, it will still say “Hello World”. We need to rebuild the image for the changes to take effect.
docker build -t hello-world .
Now, if we run a container from the new image, we will see the updated message.
docker run hello-world
Greetings, World!
…and I was serious. What do you think happens here?
FROM ubuntu:20.04
CMD ls
docker build -t tutorial-isolation .
docker run tutorial-isolation
Did you expect to see the Dockerfile
? in the ls
output? Remember, as far as the container is concerned, it is the only thing that exists on the computer. It doesn’t know that there is a Dockerfile
in the current directory; in fact, (modulo some special cases) it doesn’t even know that it’s a container.
What if we want it to see files from our host computer?
There are two ways of doing this; we can either mount a directory from our host computer into the container, or we can copy files from our host computer into the container when we build the image.
Let’s think about the first option for a second. If we mount a directory from our host computer into the container, then the container will be able to see the files in that directory. But what happens if we change the files in that directory? Will the container see those changes? What if we delete the files in that directory? Will the container still be able to see them? What if we delete the directory? Will the container still be able to see it?
Now that you’ve thought about it for a second, let’s try it out.
FROM ubuntu:20.04
CMD ls /mnt
docker build -t tutorial-mount .
docker run -v $(pwd):/mnt tutorial-mount
Did you see what you expected?
In the last section, we let the container see files from our host machine. But what if we want the image to see files from our host machine? (Think for a moment about what the difference is!)
We can do this by copying files from our host machine into the image when we build it.
FROM ubuntu:20.04
COPY Dockerfile /mnt
CMD ls /mnt
docker build -t tutorial-copy .
docker run tutorial-copy
Did you see what you expected?
Because we COPIED the file into the image when we built it, the container will still be able to see the file even if we delete it from our host machine, and if we delete the file inside the container, it won’t impact the file on our host machine.
It’s good to treat a container as a relatively immutable object: If you put all of your analysis code in the image, then anyone who runs a container from your image will get a reproducible output.
But it is also possible to “ssh” into a container (really, you’re just running a shell inside the container) and make changes to the container. This is useful for working in a more powerful virtual environment, debugging, or just poking around.
docker run -it hello-world /bin/bash
We pass the -it
flag to docker run
to tell Docker that we want to run the container in interactive (and TTY) mode. This will give us a shell inside the container.
We then pass a program that we want to run (in this case, bash). This is the command that will be run when the container starts.
root@e0b0b0b0b0b0: echo "I'm in a container!"
You can exit the container by quitting the shell (Ctrl-D) or by running exit
.
So far, we’ve learned about three capabilities that will be very useful to us:
Let’s apply these three capabilities to a real-world problem.
Let’s do a simple experiment — we’ll use a Python script to generate some random numbers and save them to a file.
First, we’ll write a requirements.txt
file. Unlike our usual workflow, we WON’T install the packages in this file into our host environment. Instead, we’ll install them into a container.
# requirements.txt
numpy
WAIT! We should version-pin our dependencies! We can do it manually, like this:
# requirements.txt
numpy==1.23.4
Or we can use the power of Docker!
FROM python:3.9
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
docker build -t tutorial-env .
Now we can version-pin using Docker by having Docker output the version of the packages that it installed.
docker run --rm tutorial-env pip freeze -r /tmp/requirements.txt
Docker containers are cheap and fast!
Each lab member can now begin using Docker to run their own experiments. It’s common to start off with some exploratory data analysis in a Jupyter notebook. Here’s a command that starts a notebook server in a Docker container with GPU access, and mounts the current directory as a volume inside the container:
docker run --rm -it --gpus all -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/datascience-notebook
This command will start a Jupyter notebook server in a Docker container, and mount the current directory as a volume inside the container. The --rm
flag tells Docker to remove the container when it exits, and the -it
flag tells Docker to run the container in interactive mode. The --gpus all
flag tells Docker to give the container access to all GPUs on the host machine. The -p 8888:8888
flag tells Docker to forward port 8888 on the host machine to port 8888 on the container. (You may want to use another port if other users are also running notebooks; for example, -p 8889:8888
will let you point to http://myfancyserver:8889
in a browser.)
The -v $(pwd):/home/jovyan/work
flag tells Docker to mount the current directory as a volume inside the container, at the path /home/jovyan/work
. Finally, the jupyter/datascience-notebook
argument tells Docker to use the jupyter/datascience-notebook
image as the base image for the container.
It is possible to connect to a Docker container in Visual Studio Code. This is useful if you want to use Visual Studio Code to edit files in the container, or if you want to use Visual Studio Code’s debugger to debug code running in the container. To connect to a Docker container in Visual Studio Code, you need to install the Remote - Containers extension. Once you’ve installed the extension, you can connect to a Docker container by clicking on the green “Remote-Containers: Open Folder in Container…” button in the bottom left corner of the Visual Studio Code window. You can then select the folder you want to open in the container, and Visual Studio Code will automatically build a Docker image for the container and start the container.
Because Docker completely isolates an environment, it is also a perfect tool for running a hyperparameter search. In this example, we’ll write a simple sklearn demo, and then try a few different hyperparameter sets, writing out evaluation results to a text file.
First, create a file called main.py
with the following contents:
# For parsing command line arguments:
import argparse
# For writing out evaluation results:
import json
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
def main():
# Parse command line arguments:
parser = argparse.ArgumentParser()
parser.add_argument("--max_depth", type=int, default=2)
parser.add_argument("--n_estimators", type=int, default=100)
args = parser.parse_args()
# Load the iris dataset:
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train a random forest classifier:
clf = RandomForestClassifier(max_depth=args.max_depth, n_estimators=args.n_estimators)
clf.fit(X_train, y_train)
# Evaluate the classifier:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Write out evaluation results:
with open(
f"results/results-depth_{args.max_depth}-ests_{args.n_estimators}.txt", "w"
) as f:
json.dump({
"accuracy": accuracy,
"max_depth": args.max_depth,
"n_estimators": args.n_estimators
}, f)
if __name__ == "__main__":
main()
Realistically, we could reuse the jupyter/datascience-notebook
image from earlier. But to show how to install custom dependencies, let’s look at an engineered example:
Create a file called Dockerfile
with the following contents:
FROM python:3.10
# Install dependencies:
RUN pip install scikit-learn
# Copy the experiment code into the container:
COPY main.py /main.py
# Run the experiment:
CMD ["python", "/main.py"]
By default, Docker will run the CMD
command when the container starts. In this case, we want to run the main.py
script, so we set the CMD
to ["python", "/main.py"]
. (Because we set defaults for argparse, we don’t need to pass any arguments to main.py
in the default case.)
Let’s make sure it works. First, build the Docker image:
docker build -t hyperparam .
Then, run the Docker container:
docker run -it -v $(pwd):/results hyperparam
You should see a new `results-depth_2-ests_100.txt` file in your current directory. If you open it, you should see something like this:
```json
{"accuracy": 0.95, "max_depth": 2, "n_estimators": 100}
Let’s run a hyperparameter search. We’ll try a few different values for max_depth
and n_estimators
, and write out the results to a text file.
Create a file called search.sh
with the following contents:
#!/bin/bash
for max_depth in 2 4 6 8 10
do
for n_estimators in 10 20 30 40 50
do
docker run -v $(pwd):/results hyperparam \
--max_depth $max_depth \
--n_estimators $n_estimators
done
done
This script will run the main.py
script with different values for max_depth
and n_estimators
. It will write out the results to a file called results/results-depth_<max_depth>-ests_<n_estimators>.txt
.
Let’s run the script:
./search.sh
When it’s done, you should see a bunch of new files in the results
directory.
Note that this runs ALL of the experiments simultaneously; for a less ridiculous workload, consider using parallel
or xargs
to run the experiments in parallel:
for max_depth in 2 4 6 8 10
do
for n_estimators in 10 20 30 40 50
do
echo "docker run -v $(pwd):/results hyperparam --max_depth $max_depth --n_estimators $n_estimators"
done
done | parallel -j 4 # Run 4 experiments in parallel
For more complex job dependency graphs, consider using a DAG job scheduler like frof.
Because everything is running in containers, it’s easy to see what’s running on the server. Just run docker ps
:
docker ps
In this post, we looked at how to use Docker to run a machine learning experiment. We used Docker to create a reproducible environment, and to run the experiment on a remote server. We also used Docker to run a hyperparameter search, and to see the status of the server. This is just a small sample of what you can do with Docker. And while it may feel like there is some extra overhead to using Docker, you may discover that many of the features of containerized, isolated environments are well worth learning a powerful new tool!