Resource unavailable (DISKS_TOTAL_GBS and CPUS) when running 5000 samples

Comments

18 comments

  • Tiffany Miller

    Hi Yossi,

    The team has started looking into this and we will get back to you tomorrow morning. If you have any other questions, let us know! Thanks for posting.

    0
    Comment actions Permalink
  • Adelaide Rhodes

    Yossi - Could you share the workspace with GROUP_Firecloud-Support@firecloud.org?

    If it is under authorization protocols. let us know.

    Adelaide

    0
    Comment actions Permalink
  • Yossi Farjoun

    done.

    0
    Comment actions Permalink
  • Doug Voet

    which workspace is it?

    0
    Comment actions Permalink
  • Adelaide Rhodes

    HI Yossi -

    I heard back from the Cromwell team that this might be a quota issue.  There are currently 2,344 cpus in use, quota is 2400.

    Have you requested an increase in your quota to accommodate this job?  

    Is there a way to subdivide the jobs to not be so close to the quota limit?

    Or, you could request more quota, I suppose.

    CPUs and persistent disk quotas: what are they and how do you request more?

    Please let me know if this answers the question.

    Adelaide

    0
    Comment actions Permalink
  • Yossi Farjoun

    hmmm. reading that page, I am unclear about the definition of "project":

     

    In Terra I can only find the word "project" next to "billing" as in "billing project". my billing project (for this workspace) is "broad-dsp-prod-special-cases" but putting that into the url doesn't work.

     

    Perhaps there should be a link to the quotas page directly in Terra? 

     

    0
    Comment actions Permalink
  • Adelaide Rhodes

    I think this has been sorted out now that you have found the project name in the Google console, is that correct?

    0
    Comment actions Permalink
  • Dan Billings

    We can request quota increase. 

     

    You are limited to 2400, what would you like the limit to be?  I.e. ( [number of cpus] x [jobs] ), maybe 10K?

    0
    Comment actions Permalink
  • Dan Billings

    The relevant url is 
    https://console.cloud.google.com/iam-admin/quotas?project=broad-dsp-prod-special-cases&service=compute.googleapis.com 

     

    look for compute service cpus in us-central1

    0
    Comment actions Permalink
  • Adelaide Rhodes

    FYI billing project owners and those with firecloud.org accounts can view project quotas at https://console.cloud.google.com/iam-admin/quotas?project=[PROJECTNAME]

    0
    Comment actions Permalink
  • Dan Billings

    I've gotten quota increases for # of cpus to unblock the bottleneck.

    The current bottleneck is ip addresses.

    I currently have 2 more requests in review, and I've done my best to escalate them:

    • ip addresses 2300 -> 10k
    • persistend disk 200 TB -> 1 PB
    0
    Comment actions Permalink
  • Yossi Farjoun

    In order to avoid asking for so many IP addresses, I moved the docker image to gcr and put "noAddress=true". after attempting to run the workflows nothing seems to be happening....the scattered task claims that is has been running for 1 hour, but the shards themselves are not moving. 

     

    Since this is the first time I tried using "noAddress=true", I'm concerned that it isn't supported by Terra...could someone let me know if I need to be concerned?

    0
    Comment actions Permalink
  • Adelaide Rhodes

    Dan - This seems similar to another ticket I had recently. The only difference between the workflow that worked and the one that did not was setting noAddress to true.

    Yossi - I have elevated this issue to a Jira ticket for a potential bug fix.

    The ticket can be tracked here:  https://broadworkbench.atlassian.net/browse/BA-5718

     

    0
    Comment actions Permalink
  • Adelaide Rhodes

    Hi Yossi -

    Based on a quick scan of the slack channel, I saw that there were a few issues mentioned about noAddress=true in the gnomad development group.

    There is a potential workaround, according to @markw, but it may not help with Terra specifically.

    Their solution was to run a script to configure subnet settings in our google compute projects.

    However, according to @ferrara, this may not work as the subnets when created in google projects by Terra may not have the flag enabled that allows private ip only instances to talk to Google apis.

    0
    Comment actions Permalink
  • Yossi Farjoun

    OK. so I'm giving up on the noAddress=true for now. but FYI, it would be helpful for users like me if it were possible to do private IPs.

     

     

    0
    Comment actions Permalink
  • Khalid Shakir

    Hi Yossi,

    Can we get a brief status update: Are your jobs running in general? If not, can you give us as much detail as possible to further debug the latest? Ultimately we'll be looking for Google Genomics Pipelines API Operations IDs, but we can start with Cromwell Workflow IDs if you have those available.

    Thanks,
    -k

    0
    Comment actions Permalink
  • Yossi Farjoun

    UPDATE:

    I'm running using up my 3000 IPs. looks like jobs are starting to trickle out. Given that I can only see in hindsight when jobs are being throttled, I can't really tell you what's going on till tomorrow. 

     

    Being able to run with noAddress=true would be a very nice addition....but right now I'm working without it.

    0
    Comment actions Permalink
  • Tiffany Miller

    Thanks Yossi. Let us know if you are unblocked today. 

    Also, The `noAddress` attribute in Terra is not working right now. We are working with Google on a plan to resolve this. We will respond back when we have that sorted.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk