Large WGS cohort genotyping with 'GATK Best Practices Germline SNPs and INDELS' Answered
Hello Beri ,
I am looking to perform joint genotyping on 4300 WGS samples with the with the 1-4-JointGenotyping-hg38 Workflow from the GATK Best Practices Germline SNPs and INDELs Workspace. I have run test batches with 100 samples across the entire genome and all 4300 samples across a few intervals of the genome without issue and am getting ready to scale up. Two questions came up as I am preparing to do this:
1) The workspace description mentions a guideline to help determine disk space for the JSON for joint genotyping. I did not see any guidelines in the documentation and any advice would be helpful.
2) I noticed that the maximum number of shards that SplitIntervals step generates is 210 based on default settings, despite inputting a scale factor of 2.5 the number of gVCF files. I am able to change this by switching the scatter_mode to 'INTERVAL_SUBDIVISION' but am wondering if there is any guidance on the ideal number of shards per sample to optimize for cost for a cohort this size.
I've forwarded your question to the workflow authors and we'll get back to you with an answer.
The workflow authors suggested if you are planning to work with a large sample set then try to enabling GnarlyGenotyper instead of GenotypeGVCFs in the workflow. This would require reblocking the gvcfs before feeding them to the pipeline which is mentioned in the workspace notes.
1)Please excuse the note about disk size guidelines, this was leftover wording for a previous workflow (wording has now been removed). There isn't any suggestions for changes to the disk size, if you come across any problems simply rerun the workflow with a larger disk size for the particular task that failed. Rerunning workflows with Terra's call caching will allow you to start from the failed task in the previous submission.
2)There isn't any guidance on the ideal number of shards per sample to optimize for cost for a cohort this size, this isn't something the authors have benchmarked.
I am running same workflow with 4400 samples and I have encountered some problems.
Did you successfully run this workflow? If so how did you do it?
I was able to run this workflow on ~4300 WGS cram files using the workflow as described. I elected to use GenotypeGVCFs (contrary to the recommendations above) in order to keep the pipeline similar to what we have previously run on our local cluster. I don't think I would do this again in the future on a cohort this large as the cost did not scale well, and would explore using the GnarlyGenotyper as recommended.
Thank you very much for the answer.
When I am trying to run my workflow with 4400 samples, but I am getting persistent disk error. Do you recall values for disk space and how much time did it take?
Also did you change number of shards to optimize workflow?
And last, if it is not secret can you tell cost of whole project?
I did switch the SplitIntervals scatter_mode to 'INTERVAL_SUBDIVISION' which resulted in thousands of shards. I initially tried to get an estimate of cost for the joint genotyping step by running GenotypeGVCFs on a few of these shards. This cost did not scale linearly and the final cost to generate a joint genotyped VCF file from CRAMs was just under ~$4 per sample.
Thanks for the answer.
I have one more question to ask.
Do you remember values for disk space and how much time did whole project take?
Please sign in to leave a comment.