Saving storage costs by deleting intermediate files

Allie Cliffe

Running a complicated workflow can lead to a large number of large intermediate outputs that increase the storage costs of a project. As an example, a large-scale project recently discovered that as much as 85% of their storage cost was for intermediate files that were never accessed or used. To (potentially) capture significant savings when working in Terra, you can choose to delete intermediate files upon successful completion of the workflow.

Set up workflows to automatically clean up intermediate files

Follow the instructions below to automatically delete intermediate files when you run a workflow.

Note that this only applies to new workflow submissions. To clean up intermediate files from a previous submission, scroll down to How to manually delete intermediate files from prior workflows

Intermediate files are kept by default unless you select the Delete intermediate outputs option in the workflow configuration (see screenshot below).

delete-intermediate-outputs.png

This is because intermediate files are required to use call caching, which is selected by default. 

Intermediate files during unsuccessful workflowsIf a workflow fails to complete, the intermediate files are not deleted. This lets you use call caching to start the workflow again right before a failed step. 

Call-caching versus "delete intermediate files" option

A workflow run with delete intermediates option enabled can always READ from the call cache, but it will not WRITE its own results to the call cache. 

Say, for example, you previously ran workflow X with delete intermediates and now want to run it again with the same inputs and call caching turned on. The workflow won't use the existing call cached workflow, because the intermediate files don’t exist anymore (and cannot be call cached). When Cromwell deletes the intermediate files, it also invalidates those call-cache entries. 

When to delete intermediate files

  • When you're running a well-tested workflow
  • When you won't be running the same workflow on the same data for further analysis 

When to use call caching

  • Before running downstream analysis on the same data
  • When you're working in a different workspace and want to reproduce earlier results
  • When you are testing or troubleshooting a partially failed workflow, or are otherwise not sure if a workflow will complete. Call caching lets you start the workflow again at the beginning of the task that failed, rather than rerunning the entire workflow from the beginning. 

How to manually delete intermediate files from prior workflows

When you may need to manually delete intermediate files

  • If you don’t want to delete intermediate files until after you complete all your analysis.
  • If you want to retroactively delete intermediate files from a workflow you ran previously.

Option 1: Mop up notebook

You can manually delete intermediate files at your discretion with this notebook script in this workspace. Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool.

Option 2: Manually delete intermediate files using lifecycle rules

Terra on GCP now supports the ability to set lifecycle rules to delete files in a defined location in a workspace storage Bucket after a defined time. This lets you clean up intermediate files generated from workflows or any other files you may want to delete without having to delete the entire workspace.

PrerequisitesYou must be a workspace owner to set lifecycle rules on a workspace.

Why set up lifecycle rules?
Lifecycle rules will save money on storage costs by managing files produced as a part of your analysis. This feature is available as a workspace setting and can be managed right in your Terra workspace. For more details, see our roadmap article.

Enabling lifecycle rules caveats

  • Enabling lifecycle rules changes the directory structure for future workflow submissions by separating files into submissions/intermediates/ and submissions/final-outputs/ directories.
  • This lets you set up a lifecycle rule to automatically delete intermediate files in the submissions/intermediates/ directory on a timeline of your choosing.
  • You may also choose to delete files in other directories by entering your own values.

This setting is not retroactive To clean up old workspaces with historical workflow submissions, you will need to manually create lifecycle rules for the individual submission directories or clean up your bucket with existing tools like the FISS mop function (instead of lifecycle rules).

Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool.

Step-by-step instructions for setting lifecycle rules with APIs

You can use the Workspace Settings API to programmatically clean up older workspaces that have many prior workflow files.

Avoid deleting files you want to keep! We recommend you carefully consider what existing submission files might be referenced in your data tables or other files you might need to keep and move them to a new folder prior to adding any lifecycle rules.

1. Go to the workspaces_v2 updateWorkspaceSettings API on the Rawls Swagger page

2. Find the folders you want to delete in your workspace Bucket.

How to find the workspace Bucket directory IDs to delete

    • Workspaces with the lifecycle setting enabled separate the directories into submissions/intermediates and submissions/final-outputs.
    • Workspaces created before October, 2022 don’t have a submission directory. Each submission is in a folder by submission ID in the Google Bucket (top two folders in the screenshot below)
    • Workspaces created since then have a submission directory, with each submission in its own subfolder (three subfolders under "submissions" in the screenshot from Google Cloud Console below).
      Screenshot-of-Google-Bucket-directory-with-old-submissions-in-separate-folders-and-new-submissions-in-the-submission-folder.png

3. Enter your workspace namespace (billing project), workspace name, and your desired lifecycle rule to the request body.

Example request body

[
{
"settingType": "GcpBucketLifecycle",
"config": {
"rules": [
{
"action": {
"actionType": "Delete"
},
"conditions": {
"age": 1,
"matchesPrefix": [
"1317e3ff-3091-477e-8162-e156a5c0e5f6/",
"62d8cbf7-008b-4c14-9424-c5773014caa0/",
"8661f9bd-5a26-471e-9e97-91bd3d57744b/",
"submissions/13319fb2-b648-4147-9b44-111b64dd5c94/",
"submissions/1920254d-f846-480a-af4b-2b215ce43b2b/",
"submissions/1ea9adf5-699e-405b-bb86-d24f4fd4a8cd/"
]
}
}
]
}
}
]

What to expect

This rule will delete any file older than 1 day in the specified folders.

Important Considerations

  • The matchesPrefix condition is a naive match. If you provide “submissions” as your prefix, it will delete any folder that starts with the text “submissions”. If you want it to match the folder name exactly, you should include a / character, for example “submissions/”.
  • A maximum of 50 matchesPrefix conditions are allowed per workspace. If you are cleaning up a lot of individual submission folders, you can rotate the rules once the folders have been deleted.
  • Lifecycle rules can take up to 24 hours to take effect. This means if you do not include an age condition or set age to 0 to delete files immediately, it may take up to 24 hours before the lifecycle rule actually deletes the files.

How to remove lifecycle rules from a workspace

1. Go to the workspaces_v2 updateWorkspaceSettings API on the Rawls Swagger page

2. Enter your workspace namespace (billing project), workspace name, and pass an empty list for rules to the request body.

Example request body

[
{
"settingType": "GcpBucketLifecycle",
"config": {
"rules": [
]
}
}
]

What to expect

This will remove all lifecycle rules from the workspace.

 

 

 

Was this article helpful?

0 out of 0 found this helpful

Comments

8 comments

  • Comment author
    Andrew Davidson

    What is the definition of an 'intermediate file'?

    I have a workflow that has 2 tasks. the first task converts a bam to a fastq. The URL to the fastq created by the first task is written to my sample table. After my second task completes I no longer need the fastq file. Is it considered an intermediate file?

    I am spending a lot of time trying to figure what and when I can delete everything other than the output of the final task. 

    What are best practices?

    Kind regards
    Andy

    0
  • Comment author
    Allie Hajian

    Hello Andrew Davidson. I'm part of the Terra User Ed team responsible for support documentation and saw your comment. I don't know the answer to your specific question, but I submitted a ticket on your behalf to our Frontline support, and you should hear back from them soon. I'll be watching the ticket to see if I can add any additional information to the support article to help other users in the same boat. 

    1
  • Comment author
    Anika Das

    Hi Andy, 

    Thanks for reaching out with your question! 

    An intermediate file is essentially what you described, a file generated by a workflow that you never use again once the pipeline has run to completion. If those files are fairly small, it’s a minor nuisance that you can probably just ignore. However, if the files are large, or if there are very many of them, you can end up incurring significant storage costs for no reason. 

    There are two options for removing intermediate files without having to manually trawl through the execution directories where they are stored. There’s a “proactive” option, in which you check a box in the workflow configuration before you launch it, that tells Terra “go ahead and delete intermediate files when the workflow has run successfully”. If you already ran the workflow without checking that box, there’s a “reactive” option, in which you run some custom functions in a notebook to delete intermediates in bulk after the fact. 

    For more information on intermediate files as well as a concrete example, you can check out this article:

    https://terra.bio/deleting-intermediate-workflow-outputs/

    Please let us know if you have any other questions!

    Kind Regards, 
    Anika

    1
  • Comment author
    Andrew Davidson
    • Edited

    Hi Anikas

    I read the description of 'what is left behind' in https://app.terra.bio/#workspaces/help-terra/Terra-Tools. (this is the Jupyter Notebook : Remove_Workflow_Intermediates).  seems like a lot of junk gets leftover. I am processing 10s of thousands of bam files. Over the course of time, it just going to clutter up my workspace. I plan to use terra for many years, all the junk will eventually cost a lot in storage.

    It would be great if there was some sort of summary file that got created and saved. Imagine in a year to two I need to understand how an output file was created. That is to say the 'pedigree' of an output file. Having a summary file with name and version of the wdl file, the URL to that version of the wdl file and a. list of all the input parameters would be very helpful. This is basically the information displayed on the job manager input tab

    It would be even better if this was somehow tied back to the sample table row for this sample. i.e. having the URL to the BAM input file is useful but very hard to use. if I had the sample table row, I would get additional metadata like the sample id, ... 

    If something like this ever gets implemented, please choose a format that is easy to parse

    Kind regards
    Andy

    1
  • Comment author
    Geraldine Van der Auwera

    Hi Andrew Davidson, I created a ticket for our support team to consider this as a feature request. If you're not familiar with it, the Feature Requests section of the forum is a great place to post suggestions like this, in part because people can upvote each other's ideas. This helps our product design team gauge which requests are more popular and therefore potentially more impactful to address. 

    0
  • Comment author
    Mark Godek

    In the Remove_Workflow_Intermediates notebook linked above, it states "What gets deleted? Workflow output files minus logs are deleted except any outputs that are bound to the Data Model."

    What is "the Data Model", specifically? Are inputs and outputs bound to the data model via the Tables or are they determined programmatically? Is the intermediate removal function only looking in the Cromwell execution directory or will it also clean up after itself if the inputs and outputs are files from external Google buckets?

    0
  • Comment author
    Mark Godek

    I just read Geraldine's blog post which Anika linked.

    https://terra.bio/deleting-intermediate-workflow-outputs/

    You must use the Tables if you are removing intermediates afterwards using the Jupyter notebook and the removal API call is only looking in the execution directory.

    It was a good article and I think it answered most of my questions.

    0
  • Comment author
    Jean Monlong

    We've had to do some cleaning up of temporary files in our workspaces manually and the provided notebook didn't always find files to "mop". We also wanted a bit more control, for eg protecting from submission of keeping recent files to keep the call caching we might still benefit from. That sort of things.

    I tried to summarize this approach in a notebook here in case it might be useful for other people: https://github.com/jmonlong/terra-utils#eraser-notebook

    Of course, please read carefully before pushing the erase button. I recommend double-checking what kind of files will be deleted/kept, and potentially inspect the full list (dry-run mode).

    0

Please sign in to leave a comment.