Efficient use of a large cohort VCF from an external bucket in a workflow
I am trying to run some code on different samples present in a very large cohort VCF that is stored in an external bucket (with specific permission requirements to access it since it is sensitive data). I wanted to do this using a customized workflow, however, from my understanding of how Terra works, this entails that everytime I launch the workflow with a different sample, terra will launch a VM and load the large VCF file in the VM to be able to run the workflow. However, because this VCF is huge (around 1.1Tb), loading the data takes forever and means that I have to create a VM with a huge disk space which is costly.
I have the impression this is not the efficient way to do it, and I was wondering what is recommended in this case. My code does not need the entire VCF, only access to the variants that are present in a specific patient, therefore it would only need to fetch the relevant data from the VCF and not copy the full file.
Is there any ways to do this? What are the best practices in this case?
Comments
1 comment
Hi Barbara,
Thanks for writing in! You're correct that workflows on Terra do run their code in a VM, and this does mean that the files will normally need to be localized into the VM. However, this feature can be turned off if need be. Here is a link to our Cromwell documentation that explains how to configure a WDL workflow to not localize the file: https://cromwell.readthedocs.io/en/stable/optimizations/FileLocalization/
Please let me know if that information was helpful or if you have any questions.
Best,
Josh
Please sign in to leave a comment.