How to troubleshoot and fix stalled workflows

Jason Cerrato
  • Updated

Is your workflow progressing slower than you expect? Taking a long time to move from submitted to queued and from queued to running, for example? Read on for why your workflow may be "stuck" and best practices for resolving the issue. 

To learn more about what's happening behind the scenes when you launch a workflow (including places where lag happens), see How the workflow system works

Stuck workflows overview

One of the benefits of working in the cloud is being able to call up the computational resources you need to complete an analysis, whether you're analyzing ten samples or ten thousand (or more!). However, even cloud resources are not infinite, and there may be times when your workflow seems to be stuck. 

The first and most likely issue is that you are bumping against a Google-imposed resource quota, something that is fairly straightforward to troubleshoot and fix. Other, much less common issues will likely require help from Terra Support. Below is a summary of constraints on speedy workflow submission  - and solutions - below in order from most to least common. 

1. Resource quota issues

GCP resource quotas are the most likely source of stuck workflows, as Terra is already optimized to accommodate the needs of users from WDL developers running a single sample at a time to large consortia analyzing tens or hundreds of thousands. 

What are resource quotas and why do they matter?

To make sure resources are available to the community, Google limits the CPUs and disks a single Google Project can use at a time. If you exceed your resource quota (or sometimes even if you are just close), Terra will not be able to secure the CPUs, GPUs, or memory requested. 

Resource quotas affect your ability to spin up a large VM to run an analysis. They can also impact the speed of your analysis (your workflow analysis will run slowly or not at all), since tasks will pause or slow as you run up against a compute or disk quota.

See What are resource quotas for more details.

Resource quota limit symptoms/error messages

  • Error message includes the word "quota" when trying to run a large workflow analysis
  • The server was not able to produce a timely response to your request. Error message "Please try again in a short while!"
  • Workflow tasks running/progressing very slow, especially if there was no problem in the past
  • Multiple instances of "worker assigned"/"worker released" cycles in the timing diagram
  • Workflow fails to launch (workflow requested more resources than allowed)

What to do if you suspect you're up against a resource quota

Step 1: Confirm you're at capacity limit (GCP resource quota)

Step 2: Identify which resource is at or near quota

Step 3: Estimate how much quota you need

Step 4: Request a quota increase

When to request more resource quota If you are seeing errors
If you see quota errors or messages in your logs - when your workflow fails because a task requested more resources than you have in your quota - you will need to update your resource quota.

If you need to see results faster
In many cases, if you exceed your resource quota, your analysis will simply run more slowly. This may be fine, or it may not. If your workflow is stalled and you need to progress, you may want to request an increase.

Step 1: Confirm you're at capacity limit (GCP resource quota) 

When your workspace Google project reaches a quota limit, Cromwell continues to submit jobs, and Google Life Sciences acknowledges them as created even if the physical VM cannot yet start.

Cromwell detects this condition in the backend and reports AwaitingCloudQuota in the Job History Dashboard. To confirm this is the case, follow the following steps to access the quota flag. 

1.1. Go to the Job History tab of your workspace.

1.2. Click on the Dashboard icon for the workflow that is progressing slowly while it is running.
Job-History-Dashboard_Screen_shot.png

1.3. If you are running up against a resource cloud quota, you should see two alerts.

    • An AwaitingCloudQuota message in the Total Call Status Counts section
    • An orange icon and the message "Submitted. Awaiting Cloud Quota" in the status column
      AwaitingCloudQuota_Screen_shot.png

Step 2: Identify which resource is at or near quota

To understand which quota could be slowing your workflow submission, you will go to GCP console for quota and resource details.

Before you start!You need to have Owner permission for the Terra billing project to view the resource quotas on the GCP console. If you do not see the options in the screenshots below, it is most likely because you don't have enough permission to view the information on GCP console.

2.1. Go to https://console.cloud.google.com/iam-admin/quotas?project=project_id where project_id is the workspace Google project ID.

The workspace Google project is in the Cloud Information section of the Dashboard.
Workspace-Google-Project-in-Dashboard_Screen_shot.png

2.2. Identify quotas that are close to the limit, estimate how much you need, and request more if needed.


Screenshot of GCP console showing quotas for Terra workspace Google project
Note that the CPU quota in region "us-central1" is maxed out (the orange bar near 100%)

Resource quotas are per regionTo run your analysis across multiple regions (e.g. us-east1 and us-central1), you need to request a larger quota in both.

Step 3: Estimate how much resource quota you need

The right amount of quota is a function of the number of workflows Terra is launching, the number of concurrent tasks running within each workflow, and the resources requested by those tasks.

To calculate the quota needed for the workflows, you may need to do a bit of diving into your WDL to examine what it is doing. 

Example resource quota estimation (three task WDL)

Consider a three-task WDL that will run on one or more samples. You would need to look across all three tasks to determine what the maximum amount of CPU and PD you expect to need at any given time.

  • Task 1: uses 10 CPUs and 10GB of PD
  • Task 2: uses 1 CPU, 1GB of PD and scatters 10-ways wide
  • Task 3: uses 10 CPU, 10GB of PD and scatters 10-ways wide

Resources needed overview

In this example, tasks 1 and 2 are using the same amount of resources because task 2 scatters. Task 3, however, uses more resources than Task 1 or 2.

When running task 3 on ten samples at once, the task requests a total of 100 CPUs and 100GB of PD (due to scattering 10-ways wide) for one sample. Because it is running on ten samples, it will need ten times those resources at once -- 1000 CPUs and 1TB of persistent disk. If the current resources quota is set at 24 CPUs and 100GB of PD and you want this workflow to run as quickly as possible, you will need to make a request for at least 1000 CPUs and 1TB of PD.

Step 4: Request a quota increase

You can submit a resource quota increase request here.

Information to include (required)

  1. Your workspace Google Project ID (found on the right of the Dashboard page under Cloud Information).
    PPW-Google-project-in-Dashboard_Screen_shot_cropped.png
  2. Which quota(s) you want to increase (i.e. CPUs, memory, etc.)
  3. What you want your new quota to be (see this section above for guidance on how to estimate)
  4. Which regions you want the increase to apply to, if applicable (e.g. us-central-1, us-east-3, etc.)
    Terra uses us-central-1 by default in most cases. If you don't specify your region for workflow submissions, this is likely the region you want the quota(s) increased for.
  5. Rationale for increase (research purpose)

We will create a request to Google on your behalf. Depending on the quota and how much of an increase you are requesting, it may take 2-3 business days for Google to process and respond to the request.

2. Other issues that can affect workflow progress

Google Cloud resource quotas are the most likely, but not only, reasons for workflow submissions to progress slowly. Read on for other things that can impact workflow submissions. Note that these are unusual, and mostly impact large submissions and submissions with a lot of parallelism.

A few caveats about slowly progressing workflows 

There is no need to “drip feed” workflows or otherwise manage capacity, with one exception: If you want a newly submitted workflow to execute immediately and you have other workflows already running or waiting to run, they may abort (cancel) the older workflows to free up slots.

Most quota limits will make workflows run or progress slowly, not fail. Your work will run with as much parallelism as capacity allows, with no impact to success rate. Workflows beyond capacity simply wait to execute. Terra is designed to always prioritize workflow success regardless of other conditions.

What to do 

Stuck workflows are not always easy to diagnose! If you've checked that your submission is not stuck because of a resource quota, first wait to see if the issue resolves itself (especially if your job is large). If it does not, Terra Support is here to help!

  1. Submit a support request
  2. Include as many details as you can (i.e. failed submission ID, symptoms, etc.)
  3. Make sure to share the workspace with support 

Capacity limits other than resource quotas (deeper dive)

A deeper dive into Terra's workflow execution system See the blog post Smarter workflow launching reduces latency and improves user experience or documentation on How the workflow system works.

Workflows submission limit (per user)

Each registered account (email) has 3000 workflow slots globally to use across all of Terra. No distinction is made between human users and service accounts. The slot count is a sum of the user’s activity across all Cloud Billing accounts, Terra Billing projects, and Terra workspaces.

Workflows beyond the submission limit may wait to start, and parallelism may be reduced. Note that the submissions will not fail, but will progress slowly as Terra waits for resources to become available. 

Be patient! You can safely submit an unlimited number of workflows at any time. Terra will only start as many workflows as there are available slots.

Platform-wide simultaneous job limit 

A single Google project can only run a fixed number of jobs simultaneously (up to 28,800 jobs per Terra Billing project shared amongst all Billing project users). Submissions above this limit will have to wait for jobs to complete. 

As of spring 2022, Terra detects when a project is at its maximum job capacity and pauses workflow starts until the project backs off from maximum. This reduces the incidence of workflows perceived as “stuck” with non-starting jobs.

Other (region-specific) GCP issues

The region doesn't support the requested virtual machine shape or resource type: a specific CPU type may not be available in a particular region, or there may be a global shortage (of GPUs, for example). In these cases, provisioning anywhere may be difficult. 

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.