Saving storage costs by deleting intermediate files

Robert Majovski

When running a workflow, intermediate steps often have outputs that are not very useful compared to the overall results. Complicated workflows can have a large number of large intermediates, which can increase the storage costs of a project. For example, a large scale project recently discovered that as much as 85% of their storage cost was going to store intermediate files that no one ever accessed or used. Terra now offers the option to delete intermediate files upon successful completion of the workflow, enabling significant savings.

The Delete Intermediate Files option explained

Intermediate files are kept unless the "Delete intermediate outputs" option in the workflow configuration (see screenshot below) is selected.  

S47_Delete-intermediate-outputs_Screnshot.png

Intermediate files during unsuccessful workflows

If a workflow fails to complete, the intermediate files will not be deleted. This allows you to use call caching to start the workflow again right before a failed step. 

Call-caching and "delete intermediate files" option

A workflow run with delete intermediates option enabled can always READ from the call cache, but it will not WRITE its own results to the call cache. 

Say, for example, you previously ran workflow X with delete intermediates and now want to run it again with the same inputs and call caching turned on. The workflow will not use the existing call cached workflow, because the intermediate files don’t exist anymore (and cannot be call cached). When Cromwell deletes the intermediate files, it also invalidates those call cache entries. 

Manually deleting intermediate files

Many researchers don’t like to delete intermediate files until after they have completed all their research. You can manually delete intermediate files at your discretion with this notebook script we've made available in this workspace. Make sure you read the instructions carefully as we cannot recover any data that you accidentally delete using this tool. 

 

 

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

5 comments

  • Comment author
    Andrew Davidson

    What is the definition of an 'intermediate file'?

    I have a workflow that has 2 tasks. the first task converts a bam to a fastq. The URL to the fastq created by the first task is written to my sample table. After my second task completes I no longer need the fastq file. Is it considered an intermediate file?

    I am spending a lot of time trying to figure what and when I can delete everything other than the output of the final task. 

    What are best practices?

     

    Kind regards

    Andy

    0
  • Comment author
    Allie Hajian

    Hello Andrew Davidson. I'm part of the Terra User Ed team responsible for support documentation and saw your comment. I don't know the answer to your specific question, but I submitted a ticket on your behalf to our Frontline support, and you should hear back from them soon. I'll be watching the ticket to see if I can add any additional information to the support article to help other users in the same boat. 

    1
  • Comment author
    Anika Das

    Hi Andy, 

     

    Thanks for reaching out with your question! 

     

    An intermediate file is essentially what you described, a file generated by a workflow that you never use again once the pipeline has run to completion. If those files are fairly small, it’s a minor nuisance that you can probably just ignore. However, if the files are large, or if there are very many of them, you can end up incurring significant storage costs for no reason. 

     

    There are two options for removing intermediate files without having to manually trawl through the execution directories where they are stored. There’s a “proactive” option, in which you check a box in the workflow configuration before you launch it, that tells Terra “go ahead and delete intermediate files when the workflow has run successfully”. If you already ran the workflow without checking that box, there’s a “reactive” option, in which you run some custom functions in a notebook to delete intermediates in bulk after the fact. 

     

    For more information on intermediate files as well as a concrete example, you can check out this article:

    https://terra.bio/deleting-intermediate-workflow-outputs/

     

    Please let us know if you have any other questions!

     

    Kind Regards, 

    Anika

     

    1
  • Comment author
    Andrew Davidson
    • Edited

    Hi Anikas

    I read the description of 'what is left behind' in https://app.terra.bio/#workspaces/help-terra/Terra-Tools. (this is the Jupyter Notebook : Remove_Workflow_Intermediates).  seems like a lot of junk gets leftover. I am processing 10s of thousands of bam files. Over the course of time, it just going to clutter up my workspace. I plan to use terra for many years, all the junk will eventually cost a lot in storage.

    It would be great if there was some sort of summary file that got created and saved. Imagine in a year to two I need to understand how an output file was created. That is to say the 'pedigree' of an output file. Having a summary file with name and version of the wdl file, the URL to that version of the wdl file and a. list of all the input parameters would be very helpful. This is basically the information displayed on the job manager input tab

    It would be even better if this was somehow tied back to the sample table row for this sample. i.e. having the URL to the BAM input file is useful but very hard to use. if I had the sample table row, I would get additional metadata like the sample id, ... 

     

    If something like this ever gets implemented, please choose a format that is easy to parse

    Kind regards

    Andy

     

     

     

    0
  • Comment author
    Geraldine Van der Auwera

    Hi Andrew Davidson, I created a ticket for our support team to consider this as a feature request. If you're not familiar with it, the Feature Requests section of the forum is a great place to post suggestions like this, in part because people can upvote each other's ideas. This helps our product design team gauge which requests are more popular and therefore potentially more impactful to address. 

    0

Please sign in to leave a comment.