What is going on behind-the-curtain when you hit "Launch Analysis" from the Terra UI? This article outlines the components working in the backend - like what Cromwell does versus what Google Pipelines API does - to help with understanding where and why you might have lag (slow runtimes) or failure. This information could be especially useful for people learning to develop WDLs and troubleshooting or optimizing large scale batch workflows analysis on Terra.
For a deeper dive in the back end, and useful suggestions about how to make your workflow submissions go faster, see this blog post: Smarter workflow launching reduces latency and improves user experience.
Glossary of useful terms
If you're new to bioinformatics, and especially if you're new to cloud computing, you may find lots of unfamiliar terms. Understanding the ones below in particular is useful when discussing what Terra is doing behind-the-scenes. Click to expand the definitions.
See a more comprehensive list at Glossary of terms related to cloud-based genomics.
Read (or re-read) Terra architecture and where your files live in it.
When we talk about data in a Terra workspace, we're really talking about data that is linked in some way to your workspace, not data that's actually "in" your workspace. In many cases, when you analyze the data, you won't copy it at all to your workspace bucket - all the analysis is done in the cloud and only (some of) the generated data may be deposited in the workspace bucket. See Understanding Data in the Cloud for more details.
A lightweight, standalone, executable package of software that includes everything needed to run an application on a virtual machine. A Docker container can specify the operating system, runtime configuration, and all necessary dependencies needed for a given application. Packaging these components together is called “containerizing” them, and it’s an effective way of making VM configurations shareable and making the analyses that depend on keeping these configurations consistent reproducible.
A free and open source platform for sharing reusable and scalable analytical tools and workflows. It’s developed by the Cancer Genome Collaboratory and used by the GA4GH. See https://dockstore.org/.
A JSON is an open-standard file and data interchange format that uses human-readable text to store and transmit data objects - including attribute–value pairs and arrays.
- A virtual machine (aka VM) is a virtual construct that is functionally equivalent to a computer - complete with processing power and storage capacity - whose technical specifications are determined by what a user requests, rather than by the hardware where the computation and storage actually take place. This is actually what makes cloud computing so flexible - when you create a virtual machine it's just like setting up a new computer, but the power and configuration is determined by whatever you choose when you're creating that machine, and you can create, delete, modify, and replace these virtual machines on-demand.
WDL - Workflow Description Language is a community-driven programming language stewarded by the community at openWDLorg. It's designed for describing data-intensive computational workflows, and is designed with a focus on accessibility for scientists without deep programming expertise. Similarly to CWL, portability is a key factor in its design, and what differentiates WDL from CWL is that WDL is designed to be more human-readable whereas CWL is primarily optimized for being machine-readable.
Under the hood, quite a lot is happening when you launch an analysis
Various system components kick into gear to ensure that your submission of one or more workflows gets properly assembled and, when that’s done, that each individual task is properly dispatched to the Google Compute Engine for execution. If systems could talk, it would kind of look like this:
Meanwhile on the surface (i.e. the UI), Terra automatically takes you to the Job History page where you can view the status of your workflow(s) and monitor how the work is progressing (note that you need to refresh the browser window to update the status).
Read each section below to understand what's happening - and expected bottlenecks - at each stage.
Terra -> Cromwell (status: queued)
Terra takes the workflow specified in the WDL and asks Cromwell to run it.
Possible sources of lag
If there are many user-submitted jobs, especially from the same billing project, your submission will remain in "Submitted" as the Cromwell engine works its way through the queue.
Cromwell -> Google PAPI (status: submitted)
Cromwell asks the Google Pipelines API (PAPI) to launch each task in the workflow when the inputs become available. Cromwell is responsible for managing the sequence of the tasks/jobs.
Possible sources of lag
If you are using preemptible machines, there will be delay when you are preempted - while a preempted machine is restarted. Note that you are not charged for this time.
Google PAPI sets up VM (status: submitted)
PAPI starts a virtual machine per task and provides the inputs; the WDL specifies what it should do, the environment to do it in (the Docker image), and requests the outputs when it is done. Each virtual machine’s (VM) requirements can be specified in the task (RAM, disk space, memory size, number of CPUs). Once the task is done, PAPI will shut down the VM.
Possible sources of lag
It can take a longer time to set up a more complex machine. You won't be charged as the machine is set up, only when it is running.
It can take time to localize large data files to the VM disk. This time has a GCP charge.
Execute WDL and Write Outputs (status: running)
The Docker required for each task will be pulled to the virtual machine along with any inputs from Google buckets. When the output is produced, it will be put in the Google bucket of the workspace where the analysis was launched. Links to the outputs will be written back to the workspace data table.
What could cause this part to run slow?
If even one shard is crunching away, the workflow will look like it is taking a long time, even though the other shards have finished.
Terra UI and corresponding backend functions
Here's a diagram view of the relationship between the Terra UI and what is happening behind the scenes.