What happens when you launch a workflow?

Allie Hajian
  • Updated

What is going on behind-the-curtain when a you hit "Launch Analysis" from the Terra UI? This article helps you understand the components working in the backend - like what Cromwell does versus what Google Pipelines API does - to help understand where and why you might have lag or failure. This information could be especially useful for people learning to develop WDLs and troubleshooting or optimizing large scale batch workflows analysis on Terra.

Back-end overview

Under the hood, quite a lot is happening when you launch an analysis: various system components kick into gear to ensure that your submission of one or more workflows gets properly assembled and, when that’s done, that each individual task is properly dispatched to the Google Compute Engine for execution. If systems could talk, it would kind of look like this:

What-happens-behind-the-scenes-diagram.png

Read each section below to understand what's happening - and expected bottlenecks - at each stage.

Terra -> Cromwell (status: queued)

What's happening
Terra takes the workflow specified in the WDL and asks Cromwell to run it.

Possible sources of lag
If there are many user-submitted jobs, especially from the same billing project, your submission will remain in "Submitted" as the Cromwell engine works its way through the queue.

Cromwell -> Google PAPI (status: submitted)

What's happening
Cromwell asks the Google Pipelines API (PAPI) to launch each task in the workflow when the inputs become available. Cromwell is responsible for managing the sequence of the tasks/jobs.

Possible sources of lag
If you are using preemptible machines, there will be delay when you are preempted - while a preempted machine is restarted. Note that you are not charged for this time.

Google PAPI sets up VM (status: submitted)

What's happening
PAPI starts a virtual machine per task and provides the inputs; the WDL specifies what it should do, the environment to do it in (the Docker image), and requests the outputs when it is done. Each virtual machine’s (VM) requirements can be specified in the task (RAM, disk space, memory size, number of CPUs). Once the task is done, PAPI will shut down the VM.

Possible sources of lag
It can take a longer time to set up a more complex machine. You won't be charged as the machine is set up, only when it is running.

It can take time to localize large data files to the VM disk. This time has a GCP charge.

Execute WDL and Write Outputs (status: running)

What's happening
The Docker required for each task will be pulled to the virtual machine along with any inputs from Google buckets. When the output is produced, it will be put in the Google bucket of the workspace where the analysis was launched. Links to the outputs will be written back to the workspace data table.

What could cause this part to run slow?
If even one shard is crunching away, the workflow will look like it is taking a long time, even though the other shards have finished.  

Meanwhile on the surface (i.e. the UI), Terra automatically takes you to the Job History page where you can view the status of your workflow(s) and monitor how the work is progressing (note that you need to refresh the browser window to update the status). 

Terra UI and corresponding backend functions

Here's a diagram view of the relationship between the Terra UI and what is happening behind the scenes. 

What-happens-behind-the-scenes-when-you-hit-launch_Diagram.png

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.