Docker tutorial: Custom cloud environments for Jupyter Notebooks

Anton Kovalsky

This is a step-by-step guide for 1) building and publishing a custom Docker image and 2) running a Jupyter Notebook on Terra using a Docker image modified to include additional packages.

Step 1. Clone the Git repository with the base images

First, you need to download all our base images by cloning the Terra GitHub repository. You can grab them all at once.

1.1. Select the green Code button on the GitHub repo page and copy the URL.

Screenshot of GitHub repo page with green code button in upper right and URL https://gitbub.com/DataBiosphere/terra in https tab with copy icon at right

1.2. Open a local terminal and execute the command Git clone LINK using the link from the previous step.

Upon executing this command, you’ll see something like this.

Screenshot of terminal with output of Git clone command including cloning into 'terra docker'... Remote: enumerating objects: 42. Done. Remote: Counting objects: 100% (42/42)

Now, you should have our entire collection of Docker base images on your local machine in a new directory called terra-docker. Inside this directory, you should see a folder terra-jupyter-r. This is the image we will modify in this tutorial.

Step 2. Modify a Docker file to meet your needs

The next step is to modify one of the base Docker files and “build an instance” of your desired Docker image (this involves just one command but can take your computer some time to accomplish).

2.1. Find the folder terra-jupyter-r (by typing in “terra-jupyter-r” in your Finder search bar) and open the Docker file (conveniently called Dockerfile) in your favorite text editor.

ScreenShot of finder window showing Dockerfile in the terra-jupyter-r directory with the option to open with sublime selected.

If you scroll through this file, at the bottom, you should see a list of R packages, mostly installed with BiocManager. This is where you will add a new package to create your custom Docker image.

2.2. Under the line containing && R -e 'BiocManager::install(c( \ add a new line, "edgeR". This will add the edgeR package - a popular BioConductor package for the analysis of digital gene expression data - to your Docker image.

screen capture of finding the line containing the code R -e 'BiocManager::install(c( using the search function in the Edit menu, adding a new line with the code edgeR, and saving the Dockerfile by selecting save from the File menu.

2.3. Once you add this to the code, just click Save! No need to Save as - you shouldn't rename the file in any way.

Step 3. Remove half-finished Docker builds on your machine

Before you build!You are almost ready to build and push your custom Docker image! Before you execute the build command, you may need to remove any half-finished Docker builds from your machine and set up your DockerHub or Google Container Registry (GCR).

We'll walk you through these steps, but if you have some Docker experience, you might not need to worry about these and can skip to Step 5. Build your custom Docker image (assuming you already have a Docker repository with a name and tag matching the image you are about to build).

If you never used Docker images on the machine you’re using for this exercise, you probably don’t need to do this part. But if you played around with Docker, you may need to follow the pruning steps below. If you skip these steps and have trouble down the road, come back to see if this helps when troubleshooting.

3.1. In your local terminal, execute the command Docker image ls to see if there are any other images on your machine. Conveniently, this command can be executed while in any directory. If you come up with an empty list, skip to Step 4. Set a destination for your Docker image.

3.2. If your list ISN'T empty (and you don’t need the images listed), execute the following command.

docker system prune -a;

3.3. Execute the Docker image ls command again to check that the pruning worked. Now the list should be empty.

Step 4. Set a destination for your Docker image

You must set up a destination for your Docker image, so there is a place to push it to.

Where to store your image Terra accepts Docker images stored in the following registries
  - Google Cloud Container Registry (GCR)
  - GitHub Container Registry (GHCR)
  - DockerHub

The advantage of using GCR is the ability to use private buckets. DockerHub users are limited to public repositories, while GCR buckets can give Terra convenient access to private resources.

At this time, Quay is not a supported registry for custom cloud environments. You can, however, use Quay images for workflow submissions.

Note: It's important to put in the same image name (and tag) you intend to use in your build command.

Follow the instructions below for setting up the destination for your image using either DockerHub or Google Container Registry (GCR).

4.1. If you haven't already, sign up for a Docker account and install DockerHub locally by following these instructions for Mac, Windows, or Linux.

4.2. Go to your DockerHub account and Create Repository.

Screencapture of steps 4.2 through 4.4 in DockerHub.

4.3.
Give your repository a Name and make sure the Visibility is set to Public so Terra can access your Docker image.

4.4. Create your repository by clicking on the blue Create Repository button.

4.5. Create a bucket in Google Cloud Storage.

4.6. Give Terra access to a private GCR bucket by adding your individual, personal Terra group as a member of the bucket.

Screenshot of bucket in storage browser in GCP console with arrow pointing to the add members button at bottom right

Why we recommend using Personal Terra groupsTerra groups are a way to harness Terra's security structure and avoid giving permission to a user ID while keeping members easy to identify. To learn how and why to make a personal Terra group to access external resources, see Best practices for accessing external resources.

Alternatively, if you want a group of collaborators to have access to your private Docker container, you can add the @firecloud.org email address for that group (found in the Groups section of your Terra profile).

Step 5. Build and push your custom Docker image

The build command must be executed from within the directory with the modified Docker file.

Before you start: Make note of these common mistakes1. Make sure the repository name and image name match what you’ve set up in your Docker hub.

2. The Docker package builds the image based on the Docker file in the present directory, so don’t forget the period (“.”) at the end of the build command!

You MUST run your command from the directory containing the dockerfile

Docker only recognizes dockerfiles named simply Dockerfile (no extensions), so you can have as many dockerfiles as you want on your computer, but they need to be in separate folders, with only one dockerfile per folder. When you execute the Docker build command, it will look for a dockerfile in the directory you're looking at in your terminal. There must be a single file simply named Dockerfile in that directory, or the command will fail.

Follow the instructions below to build and push your custom Docker image.

5.1. Change directory into your terra-jupyter-r directory using the following command.

cd terra-jupyter-r

If you're following this tutorial exactly, the contents of the folders you cloned from Git should be right.

If you're trying to use these instructions for your own Docker adventures, you may want to use the ls command to list the contents of the directory to make sure the necessary dockerfile is present. If you just made your own dockefile from scratch and you're having trouble getting rid of an extension (such as .txt), you can get rid of it by renaming the file with this command.

mv Dockerfile.txt Dockerfile

5.2. Execute the build command below.

Docker build -t RESPOSITORY_NAME/DOCKER_IMAGE_NAME:TAG_NAME .

The building process should take about 10 minutes.

5.3. Execute the push command to upload your custom image to your repo.

Docker push RESPOSITORY_NAME/DOCKER_IMAGE_NAME:TAG_NAME

This step may also take up to 10 minutes.

How to find your Docker container's digest

Sometimes you need to know a Docker container's digest - a unique content-addressable identifier - to be certain that all nodes are running the correct version of the container.

There are two ways to get the digest depending on where your image is stored. In both cases, you'll look for something with the format sha256:SOMETHING_LONG, where the SOMETHING_LONG bit is the digest.

Follow the instructions below, depending on whether your image is stored on your local machine or not.

  • In the terminal, type docker inspect at the prompt. Note: The output is more complicated (there are two things that look like sha256:SOMETHING_LONG. The one you want is the "RepoDigests" one, not the "Id"):
    ~ $ docker inspect MY_REPO/MY_IMAGE:TAG
    [
        {
            "Id": "sha256:a98acb9802cbf46eb71e28c652f58026c027d9580ff390c6fa9ae4dec07ae13d",
            "RepoTags": [
                "MY_REPO/MY_IMAGE:TAG"
            ],
            "RepoDigests": [
                "MY_REPO/MY_IMAGE@sha256:96bf2261d3ac54c30f38935d46f541b16af7af6ee3232806a2910cf19f9611ce"
            ],
    
    ...and a lot of other details we don't care about right now.
  • In the terminal, type docker pull MY_REPO/MY_IMAGE:TAG at the prompt. The digest will be displayed in the output as:
    Digest: sha256:96bf2261d3ac54c30f38935d46f541b16af7af6ee3232806a2910cf19f9611ce

Launching a Notebook with your custom Docker image

You should now be ready to launch a Notebook Cloud Environment based on your custom Docker image!

1. Navigate to the workspace Analyses page with the notebook you want to run using the custom Docker image.

2. Select the Environment Configuration button (cloud icon) in the side panel on the right side of the screen.

Screenshot of workspace Analyses page with cloud icon highlighted in right sidebar.

3. Select Environment Settings under the Jupyter section of the panel that opens up.

Screenshot of Cloud Environment Settings configuration pane with Jupyter gear icon at the top left circled.

4. If you already have a Jupyter Cloud Environment in the workspace, select the Custom Environment option at the bottom of the Application Configuration dropdown.

If you are creating a new Cloud Environment, select the option to Customize the Cloud Environment, and select the Custom Environment option at the bottom of the Application Configuration Dropdown.

Screenshot of Application Configuration dropdown with the custom environments under other environments circled

5. Fill in the required field with the name and location of the image in your repository.

Screenshot of the custom environment application configuration with repository_name/docker_image_name.tag1 in the container image field.

6. Select Create/Replace at the bottom right of the form to create or update the Environment, depending on whether one already existed for your workspace. It will take about 10 minutes for the new virtual machine (VM) to spin up.

7. Open any notebook (or create a new one) in the same workspace.

8. Test to see if the new packages have been installed on your virtual machine.

Screenshot of notebook code cell with command library(edgeR) successfully completed and output: loading required package: 1imma

9. Don't forget to save the image identifier and URL right in your notebook to keep track of which image the notebook is intended to use.

Add a custom Docker to your WDL

In your WDL, you should include MY_REPO/MY_IMAGE@sha256:SOMETHING_LONG. Note: The tag isn't there at all; it's been replaced by the digest, which is a more specific identifier.

Was this article helpful?

Comments

7 comments

  • Comment author
    Anton Kovalsky

    Hi Denis Loginov,

    If you want to use RStudio, you should look into extending this base instead: https://github.com/anvilproject/anvil-docker/tree/master/anvil-rstudio-base 
     
    The ports we use are 8000 for Jupyter and 8001 for RStudio. It's possible that it would work to launch an arbitrary image that listens on one of those ports, however we can't guarantee it, since we have other configurations besides opening the ports. 
    1
  • Comment author
    Eugene Duff

    Hi - I'm getting 10min time-outs when I try to start up my (fairly extensive) custom jupyter-R docker - is there any way around this or way to debug things? I'm currently trying to incrementally add elements to the original Dockerfile, but have the feeling it is timing out simply due to the additional packages slowing things..

    Thanks

    0
  • Comment author
    Denis Loginov

    @Merve Dede I'd guess it's installed in the base image gcr.io/deeplearning-platform-release/tf-gpu.2-7, which is probably using old versions of everything. This image is provided by Google and has been deprecated. You might have better luck with a newer one, like gcr.io/deeplearning-platform-release/tf-gpu.2-10 listed here: https://cloud.google.com/deep-learning-containers/docs/choosing-container (but there might be some incompatibilities to resolve with other packages installed in that Dockerfile..)

    0
  • Comment author
    Merve Dede

    Hello, I am trying to modify the terra-jupyter-base environment in order to run python 3.8 instead of 3.7. I can't see where in the Dockerfile the python version is specified. Do you have any advice? Thanks

    0
  • Comment author
    Denis Loginov

    And does it have to based on the Terra notebook image, or could it be another image (e.g. RStudio) that listens on port 8080?

    0
  • Comment author
    Anton Kovalsky

    Hi James, thanks for your questions! You can use gcr.io, the custom images field accepts images from both Dockerhub and GCR.

    0
  • Comment author
    jamesp

    Can we use a gcr.io repository instead of DockerHub?

    0

Please sign in to leave a comment.