Workflows in Terra
Terra allows you to execute predefined bioinformatics workflows against data in your workspace. Each workflow is a chain of individual software tasks. Each task comprises a set of software instructions to execute (code), as well as a reference to an environment in which to run that code that contains the required software tools. Software is packaged for portable execution as a Docker image, allowing the underlying along compute-environment to be precisely specified. The tasks comprising a workflow, along with details regarding how information is passed between tasks, are described in the Workflow Description Language (WDL). This description is intended to be both human- and machine-readable, and is used by our execution engine to run the actual software pipeline. A unique WDL file therefore represents a unique workflow. The version of the WDL file and the version of the docker images it references are both components contributing to the identity of an executed job.
Workflows written in WDL may be uploaded to the Broad Methods Repository, where they receive immutable versions. Docker images are stored in container repositories such as Docker Hub or Google Container Registry, where they also receive a version (a digest and a tag, see below). The WDL itself specifies which version of the docker image to use for each task.
Every time you run a workflow, Terra captures the version of the WDL, as well as the exact version of the docker image used for each task.
The relationship between the WDL, the docker images it references, and the software they encapsulate can be somewhat complicated. The remainder of this article discusses how Terra handles the versions of each of these components.
Version of the workflow description in WDL
The Broad Methods Repository creates a new snapshot every time you upload a modified version of your WDL workflow, or change any supplemental metadata (synopsis, description, etc). We store the date of each snapshot, along with who created it. All of this information is visible for each snapshot in the Methods Repository.
Snapshots cannot be modified or deleted. They can be redacted (hidden from all users), but they remain in the Methods Repository database for provenance purposes.
Further, when a WDL workflow is submitted for execution against some data, Terra stores a copy of the exact WDL in the execution engine database.
When a WDL is executed, a directory is generated in the workspace’s associated Google Cloud Storage (GCS) bucket. Each task receives a subdirectory, which is filled with several pieces of information. A shell file is written out detailing the exact commands used to run the task (i.e. all the arguments in the software call). Output logs including stdout and stderr as well as the execution engine log are placed there as well. These logs may be manually deleted by users with WRITER or OWNER access to the workspace, though this is not recommended.
Terra admits WDL workflows exported to workspaces from the Dockstore tool repository. Terra displays the Dockstore version for each workflow sent to a workspace, and allows you to toggle between versions. At present, versions in Dockstore are mutable (a version such as “1.15.1” may be updated by the author to apply to an updated WDL). Terra cannot prevent authors from altering existing workflow versions in Dockstore, but it does capture an exact copy of the WDL every time it is run. We cannot prospectively ensure version consistency of WDL files from Dockstore, but we do capture any changes that occur.
Version of individual task docker images
The WDL itself specifies the docker image with which to run each task. These images are typically stored in container registries such as Docker Hub or Google Container Registry (GCR). For example, an image reference might look like this:
Here, ‘1.10.0’ is a semantic version applied to the image as a tag by the author. In most cases these tags are sufficient to indicate the version of the software the image encapsulates.
Unfortunately, tags in a docker are mutable, and may be applied to different image builds.
It is recommended that workflow authors do not reuse the tags on images they push to remote repositories. However, this cannot be enforced. It is therefore not easy to predict whether a WDL workflow will use the exact same docker image as on previous runs. But this can be detected at runtime.
Fortunately, docker images carry a content-addressable digest (a “hash”). Even if a tag such as “latest” is moved between different image builds, these images will carry distinct digests. Terra resolves each image to its digest prior to running. Primarily, we do this when making job-avoidance determinations, to insure we only job-avoid when we are certain that the same image is being used. For each task, the digest is retained in our databases. It is not currently made visible to the user. We would be open to this feature request, or any other ideas around enhancing version traceability.
Version of the software within an image
Docker images capture an immutable software environment at the time they are built. Each distinct docker image (unique digest) assembles a collection of software tools, environment variables, etc. during the build process.
Docker images are built from a template called a dockerfile. The dockerfile can specify that software is to be downloaded from an external resource, pointing to a mutable reference such as the head of a git branch. For example:
git clone --branch master git://git.kernel.org/pub/scm/.../linux.git
This may seem like an opportunity for mutability. However, note that the clone is executed at the time the image is built. The referenced software repository may change between image builds. Yet each build will result in a unique digest. The software/environment will be consistent whenever that exact image is used to create containers. Therefore, changes in the underlying software and environment are reflected by changes in the digest.
While distinct digests indicate that an image was built multiple times, it is not easy to interrogate exactly what changed between builds. One can pull and run the specific image (by digest), run a container, and interrogate the environment by checking various software versions etc.
Version of the workflow inputs and outputs
Every time a workflow is run, the exact workflow inputs and outputs are captured. These are available to you for each workflow execution, as well as for every individual task. This execution history cannot be deleted. When a workflow input or output represents a reference to a file in GCS object storage, the gs:// path is displayed. A user with sufficient access to this data (WRITER access, if the data is in a workspace-associated GCS bucket), can modify or delete these files. However, Terra workspace buckets enable GCS object storage versioning, so that archived versions of files can be recovered.
Terra supports strong versioning of tools according to their workflow descriptions, and captures the exact software environment used for each task via docker digests. Where possible, Terra attempts to provide transparency regarding the underlying docker resources that each workflow references. In certain circumstances it may be difficult for you to identify the exact version of a software tool used with a particular run, given how it was distributed via docker image repositories. In these cases, this information can be recovered forensically by Terra administrators.