Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Large WGS cohort genotyping with 'GATK Best Practices Germline SNPs and INDELS'

Answered

Comments

7 comments

  • Avatar
    Beri

    Hi dsilencio,

    I've forwarded your question to the workflow authors and we'll get back to you with an answer.

    0
    Comment actions Permalink
  • Avatar
    Beri

    The workflow authors suggested if you are planning to work with a large sample set then try to enabling GnarlyGenotyper instead of GenotypeGVCFs in the workflow. This would require reblocking the gvcfs before feeding them to the pipeline which is mentioned in the workspace notes.

    1)Please excuse the note about disk size guidelines, this was leftover wording for a previous workflow (wording has now been removed). There isn't any suggestions for changes to the disk size, if you come across any problems simply rerun the workflow with a larger disk size for the particular task that failed. Rerunning workflows with Terra's call caching will allow you to start from the failed task in the previous submission.  

    2)There isn't any guidance on the ideal number of shards per sample to optimize for cost for a cohort this size, this isn't something the authors have benchmarked.

    0
    Comment actions Permalink
  • Avatar
    Irakli Trankashvili

    Hello dsilencio!

    I am running same workflow with 4400 samples and I have encountered some problems. 

    Did you successfully run this workflow? If so how did you do it?

    Best regards,

    Irakli

    0
    Comment actions Permalink
  • Avatar
    dsilencio

    Hi Irakli,

    I was able to run this workflow on ~4300 WGS cram files using the workflow as described. I elected to use GenotypeGVCFs (contrary to the recommendations above) in order to keep the pipeline similar to what we have previously run on our local cluster. I don't think I would do this again in the future on a cohort this large as the cost did not scale well, and would explore using the GnarlyGenotyper as recommended.

    0
    Comment actions Permalink
  • Avatar
    Irakli Trankashvili

    Hi dsilencio,

    Thank you very much for the answer.

    When I am trying to run my workflow with 4400 samples, but I am getting persistent disk error. Do you recall values for disk space and how much time did it take?

    Also did you change number of shards to optimize workflow?

    And last, if it is not secret can you tell cost of whole project?

    Best regard,

    Irakli

    0
    Comment actions Permalink
  • Avatar
    dsilencio

    I did switch the SplitIntervals scatter_mode to 'INTERVAL_SUBDIVISION' which resulted in thousands of shards. I initially tried to get an estimate of cost for the joint genotyping step by running GenotypeGVCFs on a few of these shards. This cost did not scale linearly and the final cost to generate a joint genotyped VCF file from CRAMs was just under ~$4 per sample. 

    0
    Comment actions Permalink
  • Avatar
    Irakli Trankashvili

    Hi dsilencio

    Thanks for the answer.

    I have one more question to ask.

    Do you remember values for disk space and how much time did whole project take?

    Best regard,

    Irakli

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk