The full power of working in the cloud is the ability to scale. Many researchers take advantage of Terra and Google Cloud to submit large numbers of large, resource-intensive workflows. Whether you're new to working in the cloud or have some experience, here are tips to help you scale your analysis successfully on the Terra platform.
Resources and quotas
To scale effectively, make sure you have the resources you need to run your workflows. If your workflow submissions progress very slowly or stall for long periods of time, your Billing project may have hit a Google resource quota. Jobs may stall as they wait for resources to free up again. Working with limited resources can greatly increase your submission runtime. And while you are not billed for time spent waiting for available resources, you may be interested in seeing your work progress more quickly.
You can get around this issue by filing a quota increase request for any quotas you hit. CPUs, In-use IP Addresses, Persistent Disks (PDs), and Local SSDs are among the most commonly hit resource quotas.
For more details about resource quotas and instructions on how to request an increase, see CPUs and persistent disk quotas: What are they and how do you request more?.
Make the most of call caching
Time and money are two of the most valuable resources when it comes to scaling. Call caching gives you the power to save on both fronts by letting you reuse the results from previous successful runs. This means you don’t have to rerun the same tasks from scratch every time. As long as the task inputs are the same as a previously successful run, Cromwell (our workflow engine) can automatically use those results again.
For more information on call caching, see Call caching: How it works and when to use it.
Craft your WDL for scale
To reduce the likelihood that Terra will hang trying to get your results, make sure your WDL is configured to reduce its overall number of calls.
One way to reduce the number of calls is to avoid nested scatters, which lead to a lot of duplicated metadata. In cases where scatters are nested, Cromwell (our workflow execution engine) prints out the entire array of metadata as inputs for every index in the scatter. This leads to generation of huge, unwieldy amounts of metadata, which can result in long wait times to return job results. It may even prevent Terra from being able to serve it up at all, as the number of metadata rows exceeds the current platform threshold.
The Cromwell team constantly looks for ways to improve our reliability in handling large submissions, so a metadata row limit today could be higher - or gone - tomorrow! If you do run into a situation where your metadata can't be served up in Terra Job Manager, please reach out to support@terra.bio for assistance.
Another way to configure your workflow to run efficiently is to use smaller images when possible. Containers from images with unnecessary dependencies take longer to start up, so it's best to make sure that the image used for each task call only contains what you need to run that task.
For more tips on how to craft your WDL for scale, see the WARP pipelines' best practices.
Tools to help keep an eye on costs
As your workflows scale larger and larger, so will the associated costs. To avoid runaway costs, we recommend taking advantage of built-in Terra and Google Cloud functionality (see below).
Delete Intermediate Outputs
Terra allows you to automatically delete the output files generated by intermediate tasks at the end of your workflow run, saving only those final workflow outputs. This is a great way to save on storage costs, especially if you know you won’t need those intermediate task files in the future.
Read more in Saving storage costs by deleting intermediate files.
Call Caching
Call caching allows Terra's execution engine (aka Cromwell) to detect when a job was run in the past so that it doesn't have to recompute results. The call-caching feature in Terra can save you time and money when you repeat all or parts of a workflow analysis.
Read more in Call caching: How it works and when to use it.
You cannot use BOTH intermediate files and call cachingIf you plan on rerunning any workflows with call caching enabled, you will need to make sure Delete intermediate outputs is disabled for your initial run(s). Deleting the intermediate files will result in the task no longer being usable for call caching. You need to generate the task results from scratch on your next run.
Google Cloud Budget Alerts
Google Cloud gives you the option of setting budgets for your projects. Setting a budget leads to email alerts when spending reaches certain thresholds, allowing you to keep tabs on how much of a bill you rack up as you run your work in the cloud.
For more details, see How to set up and use Google Cloud budget alerts.