When running workflows in Terra, each task in the workflow uses cloud resources such as CPU, memory, and disk. Learn how to monitor workflow costs and optimize how a workflow uses resources by tracking the resources used in each task.
Why use resource monitoring
-
Improve Reliability
Prevent resource bottlenecks that lead to crashes -
Optimize cost
Avoid over allocating resources
How to monitor workflow resources
1. In a Terra on GCP workspace, go to the workflows tab and click on a workflow card (see screenshot below) to expose the Submission configuration page.
You’ll see the Resource monitoring option below Step 2 (above the input value table - highlighted in the screenshot below).
2. Fill in one of the fields with the option you want to use. You can use a script, an image, or an image + script to monitor resources.
Two options to consider: 1. use a script or 2. use a URL to an image
For Option 2, you can provide an image script, similar to passing a script when creating a Juptyer notebook. The image script will run in the docker image so you have access to the image right before executing the monitoring tool. This isn't normally needed but can be useful in certain situations.
1. To pass a script to change an environment variable (Google project) in the docker image before it starts monitoring the workflow.
2. If the monitoring tool in the docker image runs with default settings like recording resource usage every second but you prefer longer time intervals, you can pass a parameter in the script to record things every 10 seconds.
What does the script do?
These user-provided tools surface Google-specific information for workflows running tasks on the Google API backend (see Google Pipelines API Workflow Options for more details).
-
Monitoring_script
specifies a GCS URL to a script that will be invoked prior to the user command being run.For example, if the value for monitoring_script is
gs://bucket/script.sh
, it will be invoked as./script.sh > monitoring.log &
. The valuemonitoring.log
file will be automatically de-localized.Note that if you select “Delete intermediate outputs”, the monitoring logs (similar to other log files generated by Cromwell during execution) will not be deleted.
-
Monitoring_image
specifies a Docker image to monitor the task.This image will run concurrently with the task container and provides an alternative mechanism to
monitoring_script
(the latter runs inside the task container).For example, one can use
us.gcr.io/broad-dsp-lrma/cromwell-task-monitor-bq:bs-project_override
, which reports cpu/memory/disk utilization metrics to a BigQuery table under the google project as the workspace.However, not all the users of the workspace have permission to create a BigQuery table in the google project. Therefore, in many cases an image script (next bullet point) will be used in addition to the monitoring image to grant the right permissions to create the BigQuery table.
-
Monitoring_image_script
specifies a GCS URL to a script that will be invoked on the container running themonitoring_image
.This script will be invoked instead of the ENTRYPOINT defined in the
monitoring_image
. Unlike themonitoring_script
, no files are automatically de-localized.
Example: Monitoring script
See this example script developed by the GATK-SV team.
For additional downstream analysis you can use Step one to summarize the log files from the workflow into one file, and Step two to generate some tables and plots to find places to reduce resource allocation.
Example: Monitoring image + image script
For an example of a combination of “monitoring image” and “monitoring image script” see this featured workspace.
What it does
The image feeds resource usage data for each task directly to a BigQuery database.
The image is used with an image script that grants permission to the user to create a BigQuery data table in the workspace Google project.
Step-by-step instructions
All the instructions are included in the featured workspace dashboard. You can then use this data to get insights into resource usage. The featured workspace also contains code to analyze the collected monitoring data and cost.
The monitoring image used in this featured workspace is https://github.com/broadinstitute/cromwell-task-monitor-bq.
Troubleshooting
If you have questions, please reach out to #dsp-workflows slack channel (Broadies), submit a Terra Support ticket (Main menu > Support > Contact Us from within Terra), or post in the Community forum.