Running a complicated workflow can lead to a high number of large intermediate outputs that increase the storage costs of a project. As an example, a large-scale project recently discovered that as much as 85% of their storage cost was for intermediate files that were never accessed or used. To (potentially) capture significant savings when working in Terra, you can choose to delete intermediate files upon successful completion of the workflow.
Delete Intermediate Files options explained
Intermediate files are kept unless you select the Delete intermediate outputs option in the workflow configuration (see screenshot below).
This is because intermediate files are required to use call caching, which is selected by default.
When to delete intermediate files
- When you're running a well-tested workflow
- When you won't be running the same workflow on the same data for further analysis
When to use call caching
- Before running downstream analysis on the same data
- When you're working in a different workspace and want to reproduce earlier results
- When you are testing or troubleshooting a partially failed workflow, or are otherwise not sure if a workflow will complete. Call caching lets you start the workflow again at the beginning of the task that failed, rather than rerunning the entire workflow from the beginning.
Intermediate files during unsuccessful workflows
If a workflow fails to complete, the intermediate files are not deleted. This lets you use call caching to start the workflow again right before a failed step.
Call-caching and "delete intermediate files" option
A workflow run with delete intermediates option enabled can always READ from the call cache, but it will not WRITE its own results to the call cache.
Say, for example, you previously ran workflow X with delete intermediates and now want to run it again with the same inputs and call caching turned on. The workflow won't use the existing call cached workflow, because the intermediate files don’t exist anymore (and cannot be call cached). When Cromwell deletes the intermediate files, it also invalidates those call-cache entries.
Manually deleting intermediate files
Many researchers don’t like to delete intermediate files until after they complete all their research. You can manually delete intermediate files at your discretion with this notebook script in this workspace. Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool.
What is the definition of an 'intermediate file'?
I have a workflow that has 2 tasks. the first task converts a bam to a fastq. The URL to the fastq created by the first task is written to my sample table. After my second task completes I no longer need the fastq file. Is it considered an intermediate file?
I am spending a lot of time trying to figure what and when I can delete everything other than the output of the final task.
What are best practices?
Hello Andrew Davidson. I'm part of the Terra User Ed team responsible for support documentation and saw your comment. I don't know the answer to your specific question, but I submitted a ticket on your behalf to our Frontline support, and you should hear back from them soon. I'll be watching the ticket to see if I can add any additional information to the support article to help other users in the same boat.
Thanks for reaching out with your question!
An intermediate file is essentially what you described, a file generated by a workflow that you never use again once the pipeline has run to completion. If those files are fairly small, it’s a minor nuisance that you can probably just ignore. However, if the files are large, or if there are very many of them, you can end up incurring significant storage costs for no reason.
There are two options for removing intermediate files without having to manually trawl through the execution directories where they are stored. There’s a “proactive” option, in which you check a box in the workflow configuration before you launch it, that tells Terra “go ahead and delete intermediate files when the workflow has run successfully”. If you already ran the workflow without checking that box, there’s a “reactive” option, in which you run some custom functions in a notebook to delete intermediates in bulk after the fact.
For more information on intermediate files as well as a concrete example, you can check out this article:
Please let us know if you have any other questions!
I read the description of 'what is left behind' in https://app.terra.bio/#workspaces/help-terra/Terra-Tools. (this is the Jupyter Notebook : Remove_Workflow_Intermediates). seems like a lot of junk gets leftover. I am processing 10s of thousands of bam files. Over the course of time, it just going to clutter up my workspace. I plan to use terra for many years, all the junk will eventually cost a lot in storage.
It would be great if there was some sort of summary file that got created and saved. Imagine in a year to two I need to understand how an output file was created. That is to say the 'pedigree' of an output file. Having a summary file with name and version of the wdl file, the URL to that version of the wdl file and a. list of all the input parameters would be very helpful. This is basically the information displayed on the job manager input tab
It would be even better if this was somehow tied back to the sample table row for this sample. i.e. having the URL to the BAM input file is useful but very hard to use. if I had the sample table row, I would get additional metadata like the sample id, ...
If something like this ever gets implemented, please choose a format that is easy to parse
Hi Andrew Davidson, I created a ticket for our support team to consider this as a feature request. If you're not familiar with it, the Feature Requests section of the forum is a great place to post suggestions like this, in part because people can upvote each other's ideas. This helps our product design team gauge which requests are more popular and therefore potentially more impactful to address.
In the Remove_Workflow_Intermediates notebook linked above, it states "What gets deleted? Workflow output files minus logs are deleted except any outputs that are bound to the Data Model."
What is "the Data Model", specifically? Are inputs and outputs bound to the data model via the Tables or are they determined programmatically? Is the intermediate removal function only looking in the Cromwell execution directory or will it also clean up after itself if the inputs and outputs are files from external Google buckets?
I just read Geraldine's blog post which Anika linked.
You must use the Tables if you are removing intermediates afterwards using the Jupyter notebook and the removal API call is only looking in the execution directory.
It was a good article and I think it answered most of my questions.
We've had to do some cleaning up of temporary files in our workspaces manually and the provided notebook didn't always find files to "mop". We also wanted a bit more control, for eg protecting from submission of keeping recent files to keep the call caching we might still benefit from. That sort of things.
I tried to summarize this approach in a notebook here in case it might be useful for other people: https://github.com/jmonlong/terra-utils#eraser-notebook
Of course, please read carefully before pushing the erase button. I recommend double-checking what kind of files will be deleted/kept, and potentially inspect the full list (dry-run mode).
Please sign in to leave a comment.