A container is similar to a virtual machine (VM) and can be used to contain and execute all the software required to run a particular program or set of programs. The container includes an operating system (typically some flavor of Linux), plus any required software installed on top of the OS. It can be run as a self-contained virtual environment, making it easier to reproduce the same analysis on any infrastructure that supports running the container without having to go through the pain of identifying and installing all the software dependencies on your own laptop, cluster, or cloud environment.
Docker: a branded container
Docker is one of several brands of container systems. There are other brands, such as Singularity, but Docker is the most popular and widely used. Sometimes we say "a docker" instead of "a container" - similar to how "xerox" became a verb for "to copy" due to the dominance of the Xerox company. However,
docker with a lowercase "d" is also the command-line program that you install on your machine to run Docker containers.
How to build and store containers with images and registries
A container is packaged as an image. Note: This has nothing to do with pictures; here the word "image" is used in the same software-specific way that refers to a special type of file. You know how sometimes when you need to install new software on your computer, the download file is called a "disk image"? That's because the file you download is in a format your operating system treats as if it was a physical disk on your machine. It's the same idea for a Docker image. Another way to distinguish between an image and a container is to think of the image as a snapshot of the container that isn't running.
An image can be distributed through one or more registries, which are repositories where users can store images privately or publicly in the cloud. Docker Hub is where Broad teams publish most of their Docker images here). There are others, like Dockstore, which is specifically geared toward bioinformatics, and GCR, which is Google's general-purpose container registry for use on Google Cloud.
On a local machine
One way to use Docker is on your laptop: First, you tell the
docker program to download a container image (= a file) from a registry (e.g. Docker Hub).
Then you tell it to initialize the container, which is equivalent to booting up a virtual machine. Once the container is running, you can run any software inside it that is installed on its system. For a concrete example, see this tutorial.
On a cloud-based machine
The other way to use Docker is on a cloud-based platform, like FireCloud. Workflows in FireCloud use Docker to distribute tools and applications. By referencing Docker images in a workflow configuration, anyone in the workspace can launch the same analysis without worrying about whether they are using the exact same environment or downloading the right applications.
Ensuring security and privacy when working in the cloud
If you're concerned about privacy, access to Docker images can be set through the registry. For example, if you want private images to be used in Docker Hub, add "firecloud" as a Collaborator so that it can pull the private image.
Getting a Docker image digest
There are two ways to get the digest for
my_repo/my_image:tag. In both cases, you'll work in the terminal app. The result you want will look something like
sha256:something_long, where the
something_long bit is the digest.
If the image is not on your computer
docker pull my_repo/my_image:tag at the prompt. The digest will be displayed in the output as:
If the image is on your computer
docker inspect. Note: The output is more complicated.
~ $ docker inspect my_repo/my_image:tag
...and a lot of other details we don't care about right now.
Note: In the latter case, there are two things that look like
sha256:something_long. The one you want is the "RepoDigests" one, not the "Id".
Once you have the RepoDigests, you write
my_repo/my_image@sha256:something_long in your WDL. Note: The tag isn't there at all; it's been replaced by the digest, which is a more specific identifier.