Analysis running slow? Check your Google Cloud resource quota

Jason Cerrato
  • Updated

Is your analysis running slow? Have you gotten an error message that includes the word "quota" or when trying to run a large analysis (workflow or interactive)? It could be because you've exceeded the resource quota for a particular kind of resource. Read on to understand how GCP resource quotas can affect your work, and how to ask for more when you need them. 

Overview: Why do quotas matter in Terra?

In the Google Cloud Platform (GCP), quotas limit how much of a particular GCP cloud resource you can use. Quotas prevent unforeseen spikes in usage, making sure resources are available to the community at all times. 

Resource quotas (VM compute and storage limits) impact your analyses (including speed)

These limit how many resources like central processing units (CPUs and GPUs) and persistent disks (PDs), can be used by a single Google project at any given time. Resource quotas affects your ability to spin up a large VM to run a workflow or interactive analysis. They can also impact the speed of your analysis, since tasks will pause or slow as you run up against a compute or disk quota.

In Terra these limits will apply per workspace (for those created after September 27, 2021) or per Terra billing project (for workspace created before September 27th).

Resource quotas examples

All workflows (or "methods") and Cloud Environments that run in Terra are affected by GCP compute and disk quotas. Google enforces default resource quotas for Terra Billing projects (before September 27, 2021) and for workspaces (after September 27, 2021) based on a user's GCP billing reputation.

  • CPUs: how many CPUs you can use at once across all tasks
  • GPUs: how many GPUs you can use at once across all tasks
  • Preemptible CPUs: the pool of CPUs that would only be used by preemptible instances. You can learn more about this quota here and about preemptible instances here.
  • Persistent disk standard(GB): how much total disk (non-SSD) you can have attached at once to your task VMs
  • Persistent disk SSD(GB): how much total SSD disk you can have attached at once to your task VMs
  • Local SSD(GB): how much SSD is attached directly to the server running the task VMs. You can learn more in Google's documentation. This quota only applies if you are using local SSD in your task.

What happens when you reach a resource quota?

If you bump up against your resource quota, Terra will not be able to secure the CPUs, GPUs or PD requested, and your workflow or Cloud Environment analysis will run very slowly or not at all. 

Symptoms that you're bumping up against a resource quota

Quota limits are not always easy to diagnose! Below are some behaviors and error messages that indicate you may need a quota increase. 

Symptom/error message What's happening Action to resolve
The server was not able to produce a timely response to your request. Error message "Please try again in a short while!" Resource quota exceeded Ask support to request a resource quota increase
Workflow tasks running very slow, especially if they ran fine in the past Resource quota exceeded Ask support to request a resource quota increase
Multiple instances of "worker assigned"/"worker released" cycles in the timing diagram. Resource quota exceeded Ask support to request a resource quota increase
Workflow fails to launch Workflow requested more resources than allowed Ask support to request a resource quota increase

Resource quota symptoms (analysis stalls or slow)

You may experience one of the following after launching a workflow analysis if there is not enough resource (i.e. VM compute or disk capacity) in your quota:

  • Tasks within your workflow will run slow while they wait on quota availability.
    For example, if you requested 1,000 tasks with eight CPUs each, and your quotas allow 24 CPUs at once, you can only run three tasks at a time. Each subsequent task is queued.
  • A task in your workflow may fail when it requests more resources than your quota allows.
    For example, if you requested 60 CPUs in your task and your quota is capped at 24 CPUs at once, your workflow may fail to launch.

When to request more resource quota Please note that unless you are seeing errors, you do not need to update quotas - your analysis will simply run more slowly. If your analysis runs more slowly than you expect, or if you see errors/messages related to quota in your logs, you may want to request an increase.

Note that when your GCP project reaches a quota limit, Terra continues to create jobs in the cloud, but the physical VM cannot yet start. Terra detects this condition in the backend and reports AwaitingCloudQuota in the Job History Dashboard. The VM will start automatically when quota becomes available. 

How to check your resource quota

If your analysis is running slow or failing (see the list of symptoms here), you can check the resources quota of that workspace on the GCP console.  

Step 1: Check Dashboard

When a user's GCP project reaches a quota limit, Cromwell continues to submit jobs and Life Sciences acknowledges them as created even if the physical VM cannot yet start. Cromwell now detects this condition in the backend and reports AwaitingCloudQuota in the Job History Dashboard. 

1.1. Go to the Job History tab of your workspace.

1.2. Click on the Dashboard icon of the workflow that is running slow while it is running.
Job-History-Dashboard_Screen_shot.png

1.3. If you are running up against a resource cloud quota, you will see two notifications:

    • An AwaitingCloudQuota message in the Total call Status Counts section
    • An orange icon in the Status column and the message "Submitted. Awaiting Cloud Quota"

AwaitingCloudQuota_Screen_shot.png

The status is informational and does not require any actionTo maximize throughput, you can use AwaitingCloudQuota as an indication you should check quota in Cloud Console (step 2, below) and request a quota increase from GCP.

Step 2: Check GCP console

Once you know you're running up against a resource quota limit, you can check GCP console for the details and submit a quota increase request.

You need to have Owner permission for the Terra billing project in order to view the resource quotas on the GCP console.

2.1. Go to https://console.cloud.google.com/iam-admin/quotas?project=project_id where project_id is the workspace Google project ID.

The workspace Google project in the workspace Dashboard page in the Cloud Information section:
Workspace-Google-Project-in-Dashboard_Screen_shot.png

2. Identify quotas that are close to the limit and request more (see steps below) if needed.

A long list of quotas for the workspace in GCP console (below). The CPU quota in region "us-central1" is maxed out (the orange bar near 100%).

Note that quotas are defined per regionTo run your analysis across multiple regions (e.g. us-east1 and us-central1), you need to request a larger quota in both.

Scroll down to see how to request more resource quota

How much resource quota will I need?

The right amount of quota is a function of the number of workflows being launched, the number of concurrent tasks running within each workflow, and the resources being requested by those tasks.

To calculate the quota needed for the workflows, you need to do a bit of diving into your WDL to examine what it is doing.

Example resource quota estimation (three task WDL)

Consider a three-task WDL that will run on one to many samples. You would need to look across all three tasks to determine what the maximum amount of CPU and PD you expect to need at any given time.

  • Task 1: uses 10 CPUs and 10GB of PD
  • Task 2: uses 1 CPU, 1GB of PD and scatters 10-ways wide
  • Task 3: uses 10 CPU, 10GB of PD and scatters 10-ways wide

In this example, tasks 1 and 2 are using the same amount of resources because task 2 scatters. Task 3, however, uses more resources than Task 1 or 2.

When running task 3 on ten samples at once, the task requests a total of 100 CPUs and 100GB of PD (due to scattering 10-ways wide) for one sample. Because it is running on ten samples, it will need ten times those resources at once -- 1000 CPUs and 1TB of persistent disk. If the current resources quota is set at 24 CPUs and 100GB of PD and you want this workflow to run as quickly as possible, you will need to make a request for at least 1000 CPUs and 1TB of PD.

How to request a resource (compute/memory) quota increase

You can make a resource quota increase request by sending us a message through the Contact Us module in the Terra UI, or by emailing support@terra.bio.

Please be sure to provide the following information:

  • Your Google Project ID, found on the right of the Dashboard page under Cloud Information.
    PPW-Google-project-in-Dashboard_Screen_shot_cropped.png
  • Which quota(s) you want to increase (i.e. CPUs, PDs, etc.)
  • What you want your new quota(s) to be (see this section above for guidance on how to estimate)
  • Which regions you want the increase applied to, if applicable (e.g. us-central-1, us-east-3, etc.)
    • Terra uses us-central-1 by default in most cases. If you don't specify your region for workflow submissions, this is likely the region you want the quota(s) increased for.
  • Rationale for increase (research purpose)

We will create a request to Google on your behalf. Depending on the quota and how much of an increase you are requesting, it may take 2-3 business days for Google to process and respond to the request.

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.