Resource unavailable (DISKS_TOTAL_GBS and CPUS) when running 5000 samples

Post author
Yossi Farjoun

I've been trying to process 5000  samples (Cram->Bam for now) and it's been taking very long (I submitted on Friday (3 days ago) and I still have >1000 "running"). when the long running jobs complete, I see that they spent several days waiting for a resource (I'd copy-paste the message, but it only shows up in hover-text in the timing diagram...) either DISKS or CPU. I'm not requesting an obscene amount of disk per job (only 200 gb) nor a particularly large cpu count (3 gb) 

Could you please let me know what I need to do in order to be able to actually run 5000 jobs concurrently? 

Comments

18 comments

  • Comment author
    Tiffany Miller

    Hi Yossi,

    The team has started looking into this and we will get back to you tomorrow morning. If you have any other questions, let us know! Thanks for posting.

    0
  • Comment author
    Adelaide Rhodes

    Yossi - Could you share the workspace with GROUP_Firecloud-Support@firecloud.org?

    If it is under authorization protocols. let us know.

    Adelaide

    0
  • Comment author
    Yossi Farjoun

    done.

    0
  • Comment author
    Doug Voet

    which workspace is it?

    0
  • Comment author
    Adelaide Rhodes
    • Edited

    HI Yossi -

    I heard back from the Cromwell team that this might be a quota issue.  There are currently 2,344 cpus in use, quota is 2400.

    Have you requested an increase in your quota to accommodate this job?  

    Is there a way to subdivide the jobs to not be so close to the quota limit?

    Or, you could request more quota, I suppose.

    CPUs and persistent disk quotas: what are they and how do you request more?

    Please let me know if this answers the question.

    Adelaide

    0
  • Comment author
    Yossi Farjoun

    hmmm. reading that page, I am unclear about the definition of "project":

     

    In Terra I can only find the word "project" next to "billing" as in "billing project". my billing project (for this workspace) is "broad-dsp-prod-special-cases" but putting that into the url doesn't work.

     

    Perhaps there should be a link to the quotas page directly in Terra? 

     

    0
  • Comment author
    Adelaide Rhodes

    I think this has been sorted out now that you have found the project name in the Google console, is that correct?

    0
  • Comment author
    Dan Billings

    We can request quota increase. 

     

    You are limited to 2400, what would you like the limit to be?  I.e. ( [number of cpus] x [jobs] ), maybe 10K?

    0
  • Comment author
    Dan Billings

    The relevant url is 
    https://console.cloud.google.com/iam-admin/quotas?project=broad-dsp-prod-special-cases&service=compute.googleapis.com 

     

    look for compute service cpus in us-central1

    0
  • Comment author
    Adelaide Rhodes
    • Edited

    FYI billing project owners and those with firecloud.org accounts can view project quotas at https://console.cloud.google.com/iam-admin/quotas?project=[PROJECTNAME]

    0
  • Comment author
    Dan Billings

    I've gotten quota increases for # of cpus to unblock the bottleneck.

    The current bottleneck is ip addresses.

    I currently have 2 more requests in review, and I've done my best to escalate them:

    • ip addresses 2300 -> 10k
    • persistend disk 200 TB -> 1 PB
    0
  • Comment author
    Yossi Farjoun

    In order to avoid asking for so many IP addresses, I moved the docker image to gcr and put "noAddress=true". after attempting to run the workflows nothing seems to be happening....the scattered task claims that is has been running for 1 hour, but the shards themselves are not moving. 

     

    Since this is the first time I tried using "noAddress=true", I'm concerned that it isn't supported by Terra...could someone let me know if I need to be concerned?

    0
  • Comment author
    Adelaide Rhodes
    • Edited

    Dan - This seems similar to another ticket I had recently. The only difference between the workflow that worked and the one that did not was setting noAddress to true.

    Yossi - I have elevated this issue to a Jira ticket for a potential bug fix.

    The ticket can be tracked here:  https://broadworkbench.atlassian.net/browse/BA-5718

     

    0
  • Comment author
    Adelaide Rhodes

    Hi Yossi -

    Based on a quick scan of the slack channel, I saw that there were a few issues mentioned about noAddress=true in the gnomad development group.

    There is a potential workaround, according to @markw, but it may not help with Terra specifically.

    Their solution was to run a script to configure subnet settings in our google compute projects.

    However, according to @ferrara, this may not work as the subnets when created in google projects by Terra may not have the flag enabled that allows private ip only instances to talk to Google apis.

    0
  • Comment author
    Yossi Farjoun

    OK. so I'm giving up on the noAddress=true for now. but FYI, it would be helpful for users like me if it were possible to do private IPs.

     

     

    0
  • Comment author
    Khalid Shakir

    Hi Yossi,

    Can we get a brief status update: Are your jobs running in general? If not, can you give us as much detail as possible to further debug the latest? Ultimately we'll be looking for Google Genomics Pipelines API Operations IDs, but we can start with Cromwell Workflow IDs if you have those available.

    Thanks,
    -k

    0
  • Comment author
    Yossi Farjoun

    UPDATE:

    I'm running using up my 3000 IPs. looks like jobs are starting to trickle out. Given that I can only see in hindsight when jobs are being throttled, I can't really tell you what's going on till tomorrow. 

     

    Being able to run with noAddress=true would be a very nice addition....but right now I'm working without it.

    0
  • Comment author
    Tiffany Miller

    Thanks Yossi. Let us know if you are unblocked today. 

    Also, The `noAddress` attribute in Terra is not working right now. We are working with Google on a plan to resolve this. We will respond back when we have that sorted.

    0

Please sign in to leave a comment.