Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Saving storage costs by deleting intermediate files

Follow

Comments

5 comments

  • Avatar
    Andrew Davidson

    What is the definition of an 'intermediate file'?

    I have a workflow that has 2 tasks. the first task converts a bam to a fastq. The URL to the fastq created by the first task is written to my sample table. After my second task completes I no longer need the fastq file. Is it considered an intermediate file?

    I am spending a lot of time trying to figure what and when I can delete everything other than the output of the final task. 

    What are best practices?

     

    Kind regards

    Andy

    0
    Comment actions Permalink
  • Avatar
    Allie Hajian

    Hello Andrew Davidson. I'm part of the Terra User Ed team responsible for support documentation and saw your comment. I don't know the answer to your specific question, but I submitted a ticket on your behalf to our Frontline support, and you should hear back from them soon. I'll be watching the ticket to see if I can add any additional information to the support article to help other users in the same boat. 

    1
    Comment actions Permalink
  • Avatar
    Anika Das

    Hi Andy, 

     

    Thanks for reaching out with your question! 

     

    An intermediate file is essentially what you described, a file generated by a workflow that you never use again once the pipeline has run to completion. If those files are fairly small, it’s a minor nuisance that you can probably just ignore. However, if the files are large, or if there are very many of them, you can end up incurring significant storage costs for no reason. 

     

    There are two options for removing intermediate files without having to manually trawl through the execution directories where they are stored. There’s a “proactive” option, in which you check a box in the workflow configuration before you launch it, that tells Terra “go ahead and delete intermediate files when the workflow has run successfully”. If you already ran the workflow without checking that box, there’s a “reactive” option, in which you run some custom functions in a notebook to delete intermediates in bulk after the fact. 

     

    For more information on intermediate files as well as a concrete example, you can check out this article:

    https://terra.bio/deleting-intermediate-workflow-outputs/

     

    Please let us know if you have any other questions!

     

    Kind Regards, 

    Anika

     

    1
    Comment actions Permalink
  • Avatar
    Andrew Davidson

    Hi Anikas

    I read the description of 'what is left behind' in https://app.terra.bio/#workspaces/help-terra/Terra-Tools. (this is the Jupyter Notebook : Remove_Workflow_Intermediates).  seems like a lot of junk gets leftover. I am processing 10s of thousands of bam files. Over the course of time, it just going to clutter up my workspace. I plan to use terra for many years, all the junk will eventually cost a lot in storage.

    It would be great if there was some sort of summary file that got created and saved. Imagine in a year to two I need to understand how an output file was created. That is to say the 'pedigree' of an output file. Having a summary file with name and version of the wdl file, the URL to that version of the wdl file and a. list of all the input parameters would be very helpful. This is basically the information displayed on the job manager input tab

    It would be even better if this was somehow tied back to the sample table row for this sample. i.e. having the URL to the BAM input file is useful but very hard to use. if I had the sample table row, I would get additional metadata like the sample id, ... 

     

    If something like this ever gets implemented, please choose a format that is easy to parse

    Kind regards

    Andy

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Geraldine Van der Auwera

    Hi Andrew Davidson, I created a ticket for our support team to consider this as a feature request. If you're not familiar with it, the Feature Requests section of the forum is a great place to post suggestions like this, in part because people can upvote each other's ideas. This helps our product design team gauge which requests are more popular and therefore potentially more impactful to address. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk