Running a complicated workflow can lead to a large number of large intermediate outputs that increase the storage costs of a project. As an example, a large-scale project recently discovered that as much as 85% of their storage cost was for intermediate files that were never accessed or used. To (potentially) capture significant savings when working in Terra, you can choose to delete intermediate files upon successful completion of the workflow.
Set up workflows to automatically clean up intermediate files
Follow the instructions below to automatically delete intermediate files when you run a workflow.
Note that this only applies to new workflow submissions. To clean up intermediate files from a previous submission, scroll down to How to manually delete intermediate files from prior workflows.
Intermediate files are kept by default unless you select the Delete intermediate outputs option in the workflow configuration (see screenshot below).
This is because intermediate files are required to use call caching, which is selected by default.
Intermediate files during unsuccessful workflowsIf a workflow fails to complete, the intermediate files are not deleted. This lets you use call caching to start the workflow again right before a failed step.
Call-caching versus "delete intermediate files" option
A workflow run with delete intermediates option enabled can always READ from the call cache, but it will not WRITE its own results to the call cache.
Say, for example, you previously ran workflow X with delete intermediates and now want to run it again with the same inputs and call caching turned on. The workflow won't use the existing call cached workflow, because the intermediate files don’t exist anymore (and cannot be call cached). When Cromwell deletes the intermediate files, it also invalidates those call-cache entries.
When to delete intermediate files
- When you're running a well-tested workflow
- When you won't be running the same workflow on the same data for further analysis
When to use call caching
- Before running downstream analysis on the same data
- When you're working in a different workspace and want to reproduce earlier results
- When you are testing or troubleshooting a partially failed workflow, or are otherwise not sure if a workflow will complete. Call caching lets you start the workflow again at the beginning of the task that failed, rather than rerunning the entire workflow from the beginning.
How to manually delete intermediate files from prior workflows
When you may need to manually delete intermediate files
- If you don’t want to delete intermediate files until after you complete all your analysis.
- If you want to retroactively delete intermediate files from a workflow you ran previously.
Option 1: Mop up notebook
You can manually delete intermediate files at your discretion with this notebook script in this workspace. Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool.
Option 2: Manually delete intermediate files using lifecycle rules
Terra on GCP now supports the ability to set lifecycle rules to delete files in a defined location in a workspace storage Bucket after a defined time. This lets you clean up intermediate files generated from workflows or any other files you may want to delete without having to delete the entire workspace.
PrerequisitesYou must be a workspace owner to set lifecycle rules on a workspace.
Why set up lifecycle rules?
Lifecycle rules will save money on storage costs by managing files produced as a part of your analysis. This feature is available as a workspace setting and can be managed right in your Terra workspace. For more details, see our roadmap article.
Enabling lifecycle rules caveats
- Enabling lifecycle rules changes the directory structure for future workflow submissions by separating files into
submissions/intermediates/
andsubmissions/final-outputs/
directories. - This lets you set up a lifecycle rule to automatically delete intermediate files in the
submissions/intermediates/
directory on a timeline of your choosing. - You may also choose to delete files in other directories by entering your own values.
This setting is not retroactive To clean up old workspaces with historical workflow submissions, you will need to manually create lifecycle rules for the individual submission directories or clean up your bucket with existing tools like the FISS mop function (instead of lifecycle rules).
Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool.
Step-by-step instructions for setting lifecycle rules with APIs
You can use the Workspace Settings API to programmatically clean up older workspaces that have many prior workflow files.
Avoid deleting files you want to keep! We recommend you carefully consider what existing submission files might be referenced in your data tables or other files you might need to keep and move them to a new folder prior to adding any lifecycle rules.
1. Go to the workspaces_v2 updateWorkspaceSettings API on the Rawls Swagger page.
2. Find the folders you want to delete in your workspace Bucket.
How to find the workspace Bucket directory IDs to delete
-
- Workspaces with the lifecycle setting enabled separate the directories into
submissions/intermediates
andsubmissions/final-outputs
. - Workspaces created before October, 2022 don’t have a submission directory. Each submission is in a folder by submission ID in the Google Bucket (top two folders in the screenshot below)
- Workspaces created since then have a submission directory, with each submission in its own subfolder (three subfolders under "submissions" in the screenshot from Google Cloud Console below).
- Workspaces with the lifecycle setting enabled separate the directories into
3. Enter your workspace namespace (billing project), workspace name, and your desired lifecycle rule to the request body.
Example request body
[
{
"settingType": "GcpBucketLifecycle",
"config": {
"rules": [
{
"action": {
"actionType": "Delete"
},
"conditions": {
"age": 1,
"matchesPrefix": [
"1317e3ff-3091-477e-8162-e156a5c0e5f6/",
"62d8cbf7-008b-4c14-9424-c5773014caa0/",
"8661f9bd-5a26-471e-9e97-91bd3d57744b/",
"submissions/13319fb2-b648-4147-9b44-111b64dd5c94/",
"submissions/1920254d-f846-480a-af4b-2b215ce43b2b/",
"submissions/1ea9adf5-699e-405b-bb86-d24f4fd4a8cd/"
]
}
}
]
}
}
]
What to expect
This rule will delete any file older than 1 day in the specified folders.
Important Considerations
- The matchesPrefix condition is a naive match. If you provide “submissions” as your prefix, it will delete any folder that starts with the text “submissions”. If you want it to match the folder name exactly, you should include a / character, for example “submissions/”.
- A maximum of 50 matchesPrefix conditions are allowed per workspace. If you are cleaning up a lot of individual submission folders, you can rotate the rules once the folders have been deleted.
- Lifecycle rules can take up to 24 hours to take effect. This means if you do not include an age condition or set age to 0 to delete files immediately, it may take up to 24 hours before the lifecycle rule actually deletes the files.
How to remove lifecycle rules from a workspace
1. Go to the workspaces_v2 updateWorkspaceSettings API on the Rawls Swagger page.
2. Enter your workspace namespace (billing project), workspace name, and pass an empty list for rules to the request body.
Example request body
[
{
"settingType": "GcpBucketLifecycle",
"config": {
"rules": [
]
}
}
]
What to expect
This will remove all lifecycle rules from the workspace.