Hello Beri ,
I am looking to perform joint genotyping on 4300 WGS samples with the with the 1-4-JointGenotyping-hg38 Workflow from the GATK Best Practices Germline SNPs and INDELs Workspace. I have run test batches with 100 samples across the entire genome and all 4300 samples across a few intervals of the genome without issue and am getting ready to scale up. Two questions came up as I am preparing to do this:
1) The workspace description mentions a guideline to help determine disk space for the JSON for joint genotyping. I did not see any guidelines in the documentation and any advice would be helpful.
2) I noticed that the maximum number of shards that SplitIntervals step generates is 210 based on default settings, despite inputting a scale factor of 2.5 the number of gVCF files. I am able to change this by switching the scatter_mode to 'INTERVAL_SUBDIVISION' but am wondering if there is any guidance on the ideal number of shards per sample to optimize for cost for a cohort this size.