How to troubleshoot and fix stalled workflows

Jason Cerrato
  • Updated

Is your workflow taking a long time to move from submitted to queued and from queued to running? Read on for why your workflow may be "stuck" and best practices for resolving the issue. 

To learn more about what's happening behind the scenes when you launch a workflow (including places where lag happens), see How the workflow system works

Stuck workflows overview

One of the benefits of working in the cloud is being able to call up the computational resources you need to complete an analysis, whether you're analyzing ten samples or ten thousand (or more!). However, even cloud resources aren't infinite. Sometimes, your workflow may appear to be stuck. 

Terra is already optimized to accommodate the needs of different users - from WDL developers running a single sample at a time to large consortia analyzing tens or hundreds of thousands. You're not likely to be waiting in line behind a big Terra submission, in other words. If your workflow is stalled, it might be a GCP issue. 

The top reason for stalled workflows: Resource quota issues

The most likely issue:  You're up against a Google-imposed resource quota, something fairly straightforward to troubleshoot and fix. 

What are resource quotas, and why do they matter?

To make sure resources are available to the community, Google limits the CPUs and disks a single Google Project can use at a time. If you exceed your resource quota (or sometimes even if you are just close), Terra cannot secure the CPUs, GPUs, or memory requested. 

Resource quotas affect your ability to spin up a large VM to run an analysis. They can also impact the speed of your analysis (your workflow analysis will run slowly or not at all), since tasks will pause or slow as you run up against a compute or disk quota.

See What are resource quotas for more details.

Resource quota limit symptoms/error messages

  • Error message includes the word "quota" when trying to run a large workflow analysis
  • The server was not able to produce a timely response to your request. Error message "Please try again in a short while!"
  • Workflow tasks running/progressing very slow, especially if there was no problem in the past
  • Multiple instances of worker assigned or worker released cycles in the timing diagram
  • Workflow fails to launch (workflow requested more resources than allowed)
  • Error messages including the message "error checking TOS" may also indicate quota limit issues

Other reasons for stalled workflows

Other, much less common issues may require help from Terra Support.

Click or scroll down for a summary of constraints on speedy workflow submission  - and solutions - in order from most to least common. 

What to do if you suspect you're up against a resource quota

Step 1: Confirm you're at capacity limit (Google Cloud resource quota)

Step 2: Identify which resource is at or near quota

Step 3: Estimate how much quota you need

Step 4: Request a quota increase

When to request more resource quota If you are seeing errors
If you see quota errors or messages in your logs - when your workflow fails because a task requested more resources than you have in your quota - you will need to update your resource quota.

If you need to see results faster
In many cases, if you exceed your resource quota, your analysis will run more slowly. This may or may not be fine. If your workflow is stalled and you need to progress, you may want to request an increase.

Step 1: Confirm you're at capacity limit 

When your workspace Google project reaches a quota limit, Cromwell continues to submit jobs, and Google Life Sciences acknowledges them as created even if the physical VM cannot yet start.

Cromwell detects this condition in the backend and reports AwaitingCloudQuota in the Job History Dashboard. To confirm this is the case, follow these steps to access the quota flag. 

1.1. Go to the Job History tab of your workspace and click on the submission (in the left column) to get workflow details.

1.2. Click on the Dashboard icon for the workflow that is progressing slowly while it is running.
Screenshot of the submission details pane in the Job History page with two samples running. The dashboard icon in the middle under the links header for the second sample is circled

1.3. If you are running up against a resource cloud quota, you should see two alerts.

  1. An AwaitingCloudQuota message in the Total Call Status Counts section
  2. An orange icon and the message "Submitted. Awaiting Cloud Quota" in the status column

Step 2: Identify which resource is at or near quota

To understand which quota could be slowing your workflow submission, go to Google Cloud console for quota and resource details.

Before you start!You need to have Owner permission for the Terra billing project to view the resource quotas on the Google Cloud console. If you do not see the options in the screenshots below, you don't have enough permission to view the information on Google Cloud console.

2.1. Go to where project_id is the workspace Google project ID.

How to find the Google project ID

The workspace Google project is in the Cloud Information section of the Dashboard.

Screenshot of workspace dashboard with an arrow pointing to the Google project ID in the Cloud information section at the right

2.2. Identify quotas that are close to the limit, estimate how much you need, and request more if needed.

Screenshot of Google Cloud console showing quotas for Terra workspace Google project
Note: The CPU quota in region "us-central1" is maxed out (the orange bar near 100%)

Resource quotas are per regionTo run your analysis across multiple regions (e.g., us-east1 and us-central1), you need to request a larger quota in both.

Step 3: Estimate how much resource quota you need

The right amount of quota is a function of the number of workflows Terra is launching, the number of concurrent tasks running within each workflow, and the resources requested by those tasks.

To calculate the quota needed for the workflows, you may need to dive into your WDL. 

Example resource quota estimation (three task WDL)

Consider a three-task WDL that will run on one or more samples. You need to look across all three tasks to find the maximum amount of CPU and PD you expect to need at any given time.

  • Task 1: uses 10 CPUs and 10GB of PD
  • Task 2: uses one CPU,  one GB of PD, and scatters 10-ways wide
  • Task 3: uses 10 CPU, 10GB of PD, and scatters 10-ways wide

Resources needed overview

In this example, tasks 1 and 2 are using the same amount of resources because task 2 scatters. Task 3, however, uses more resources than Task 1 or 2.

When running task 3 on ten samples at once, the task requests a total of 100 CPUs and 100GB of PD (due to scattering 10-ways wide) for one sample. Because it is running on 10 samples, it will need 10 times those resources at once -- 1000 CPUs and 1TB of persistent disk. If the current resources quota is set at 24 CPUs and 100GB of PD and you want this workflow to run as quickly as possible, you will need to make a request for at least 1000 CPUs and 1TB of PD.

Step 4: Request a quota increase

Billing project owners can request a resource quota increase from Google directly following the directions below. 

4.1. Go to Google Cloud console > IAM & Admin > Quotas

4.2. Search by the workspace Google project in the workspace where you suspect you are reaching a resource quota. 
Screenshot of the quota request page on GCP console with the project jac-terra-billing-project highlighted - number 2 - in the dropdown at top left, the near the limit 8,999 highlighted with a number three and an arrow pointing to it, and an arrow and number four pointing to the edit quotas and pencil icon at the right.

  • Your workspace Google Project ID is on the right side in the Dashboard page under Cloud Information.
    Screenshot of workspace dashboard with an arrow pointing to the Workspace Google Project ID in the cloud information section at the right

4.3. Check the quota(s) you want to increase (i.e., CPUs, memory, etc.) by clicking the View quotas link under Near the limit.

4.4. Click the Edit quotas pencil icon.

4.5. Fill in the quota changes form.

Required fields and recommendations

    • New limit field
      What you want your new quota to be (see this section above for guidance on how to estimate)
    • Request description
      Why you are asking for the request. The more detailed you can be, the better, including which regions you want the increase to apply to, if applicable (e.g., us-central-1, us-east-3, etc.)

      Terra uses us-central-1 by default in most cases. If you don't specify your region for workflow submissions, this will be the region for which the quota(s) are increased.

What to expect

Depending on the quota and how much of an increase you are requesting, it may take two to three business days for Google to process and respond to the request.You should get an email from Google about your quota increase.

Other issues that can affect workflow progress

Google Cloud resource quotas are the most likely, but not the only, reasons for workflow submissions to progress slowly. Read on for other things that can impact workflow submissions. Note: These are unusual, and mostly impact large submissions and submissions with a lot of parallelism.

A few caveats about slowly progressing workflows 

There is no need to “drip feed” workflows or otherwise manage capacity, with one exception 

If you want a newly submitted workflow to execute immediately and you have other workflows already running or waiting to run, they may abort (cancel) the older workflows to free up slots.

Most quota limits will make workflows run or progress slowly, not fail

Your work will run with as much parallelism as capacity allows, with no impact to success rate. Workflows beyond capacity simply wait to execute. Terra is designed to always prioritize workflow success regardless of other conditions.

What to do 

Stuck workflows are not always easy to diagnose! If you've checked that your submission is not stuck because of a resource quota, first wait to see if the issue resolves itself (especially if your job is large). If it does not, Terra Support is here to help!

  1. Submit a support request
  2. Include as many details as you can (i.e., failed submission ID, symptoms, etc.)
  3. Make sure to share the workspace with support 

Capacity limits other than resource quotas (deeper dive)

Although resource quotas are the most likely cause of stalled workflows, they are not the only thing that can be slowing your submission. Other causes (and suggestions for how to deal with them) are below. And if you can't figure it out, submit a support ticket (select Contact Us from the main navigation menu, under Support). We're happy to help!

A deeper dive into Terra's workflow execution system See the blog post Smarter workflow launching reduces latency and improves user experience or documentation on How the workflow system works.

Workflows submission limit (per user)

Each registered account (email) has 3,000 workflow slots globally to use across all of Terra. No distinction is made between human users and service accounts. The slot count is a sum of the user’s activity across all Cloud Billing accounts, Terra Billing projects, and Terra workspaces.

Workflows beyond the submission limit may wait to start, and parallelism may be reduced. Note: The submissions will not fail, but will progress slowly as Terra waits for resources to become available. 

Be patient! You can safely submit an unlimited number of workflows at any time. Terra will only start as many workflows as there are available slots.

Platform-wide simultaneous job limit 

A single Google project can only run a fixed number of jobs simultaneously (up to 28,800 jobs per Terra Billing project shared among all Billing project users). Submissions above this limit have to wait for jobs to complete. 

As of spring 2022, Terra detects when a project is at its maximum job capacity and pauses workflow starts until the project backs off from maximum. This reduces the incidence of workflows perceived as “stuck” with nonstarting jobs.

Other (region-specific) Google Cloud issues

The region doesn't support the requested virtual machine shape or resource type: A specific CPU type may not be available in a particular region, or there may be a global shortage (e.g., of GPUs). In these cases, provisioning anywhere may be difficult. 

Can you give us feedback on this article?

Was this article helpful?



Please sign in to leave a comment.