Monitoring GCP cloud resources used in a workflow

Allie Cliffe
  • Updated

When running workflows in Terra, each task in the workflow uses cloud resources such as CPU, memory, and disk. Learn how to monitor workflow costs and optimize how a workflow uses resources by tracking the resources used in each task.

Why use resource monitoring

  • Improve Reliability
    Prevent resource bottlenecks that lead to crashes
  • Optimize cost
    Avoid over allocating resources

How to monitor workflow resources

1. In a Terra on GCP workspace, go to the workflows tab and click on a workflow card (see screenshot below) to expose the Submission configuration page.

Screenshot-of-workflows-page-highlighting-the-Ex-1a_calculateGPA-workflow-card.png

You’ll see the Resource monitoring option below Step 2 (above the input value table - highlighted in the screenshot below).

Screenshot-of-workflow-configuration-page-in-Terra-GCP-with-Resource-monitoring-option-checked-and-highlighting-the-three-fields-for-how-to-monitor.png

2. Fill in one of the fields with the option you want to use. You can use a script, an image, or an image + script to monitor resources. 

Two options to consider: 1. use a script or 2. use a URL to an image
For Option 2, you can provide an image script, similar to passing a script when creating a Juptyer notebook. The image script will run in the docker image so you have access to the image right before executing the monitoring tool. This isn't normally needed but can be useful in certain situations.

1. To pass a script to change an environment variable (Google project) in the docker image before it starts monitoring the workflow.

2. If the monitoring tool in the docker image runs with default settings like recording resource usage every second but you prefer longer time intervals, you can pass a parameter in the script to record things every 10 seconds.

What does the script do?

These user-provided tools surface Google-specific information for workflows running tasks on the Google API backend (see Google Pipelines API Workflow Options for more details).

  • Monitoring_script specifies a GCS URL to a script that will be invoked prior to the user command being run.

    For example, if the value for monitoring_script is gs://bucket/script.sh, it will be invoked as ./script.sh > monitoring.log &. The value monitoring.log file will be automatically de-localized.

    Note that if you select “Delete intermediate outputs”, the monitoring logs (similar to other log files generated by Cromwell during execution) will not be deleted.

  • Monitoring_image specifies a Docker image to monitor the task.

    This image will run concurrently with the task container and provides an alternative mechanism to monitoring_script (the latter runs inside the task container).

    For example, one can use us.gcr.io/broad-dsp-lrma/cromwell-task-monitor-bq:bs-project_override, which reports cpu/memory/disk utilization metrics to a BigQuery table under the google project as the workspace.

    However, not all the users of the workspace have permission to create a BigQuery table in the google project. Therefore, in many cases an image script (next bullet point) will be used in addition to the monitoring image to grant the right permissions to create the BigQuery table.

  • Monitoring_image_script specifies a GCS URL to a script that will be invoked on the container running the monitoring_image.

    This script will be invoked instead of the ENTRYPOINT defined in the monitoring_image. Unlike the monitoring_script, no files are automatically de-localized.

Example: Monitoring script

See this example script developed by the GATK-SV team.

For additional downstream analysis you can use Step one to summarize the log files from the workflow into one file, and Step two to generate some tables and plots to find places to reduce resource allocation.

Example: Monitoring image + image script

For an example of a combination of “monitoring image” and “monitoring image script” see this featured workspace.

What it does

The image feeds resource usage data for each task directly to a BigQuery database.
The image is used with an image script that grants permission to the user to create a BigQuery data table in the workspace Google project.

Step-by-step instructions

All the instructions are included in the featured workspace dashboard. You can then use this data to get insights into resource usage. The featured workspace also contains code to analyze the collected monitoring data and cost.

The monitoring image used in this featured workspace is https://github.com/broadinstitute/cromwell-task-monitor-bq.

Troubleshooting

If you have questions, please reach out to #dsp-workflows slack channel (Broadies), submit a Terra Support ticket (Main menu > Support > Contact Us from within Terra), or post in the Community forum.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.