Resource unavailable (DISKS_TOTAL_GBS and CPUS) when running 5000 samples
I've been trying to process 5000 samples (Cram->Bam for now) and it's been taking very long (I submitted on Friday (3 days ago) and I still have >1000 "running"). when the long running jobs complete, I see that they spent several days waiting for a resource (I'd copy-paste the message, but it only shows up in hover-text in the timing diagram...) either DISKS or CPU. I'm not requesting an obscene amount of disk per job (only 200 gb) nor a particularly large cpu count (3 gb)
Could you please let me know what I need to do in order to be able to actually run 5000 jobs concurrently?
Comments
18 comments
Hi Yossi,
The team has started looking into this and we will get back to you tomorrow morning. If you have any other questions, let us know! Thanks for posting.
Yossi - Could you share the workspace with GROUP_Firecloud-Support@firecloud.org?
If it is under authorization protocols. let us know.
Adelaide
done.
which workspace is it?
HI Yossi -
I heard back from the Cromwell team that this might be a quota issue. There are currently 2,344 cpus in use, quota is 2400.
Have you requested an increase in your quota to accommodate this job?
Is there a way to subdivide the jobs to not be so close to the quota limit?
Or, you could request more quota, I suppose.
CPUs and persistent disk quotas: what are they and how do you request more?
Please let me know if this answers the question.
Adelaide
hmmm. reading that page, I am unclear about the definition of "project":
In Terra I can only find the word "project" next to "billing" as in "billing project". my billing project (for this workspace) is "broad-dsp-prod-special-cases" but putting that into the url doesn't work.
Perhaps there should be a link to the quotas page directly in Terra?
I think this has been sorted out now that you have found the project name in the Google console, is that correct?
We can request quota increase.
You are limited to 2400, what would you like the limit to be? I.e. ( [number of cpus] x [jobs] ), maybe 10K?
The relevant url is
https://console.cloud.google.com/iam-admin/quotas?project=broad-dsp-prod-special-cases&service=compute.googleapis.com
look for compute service cpus in us-central1
FYI billing project owners and those with firecloud.org accounts can view project quotas at https://console.cloud.google.com/iam-admin/quotas?project=[PROJECTNAME]
I've gotten quota increases for # of cpus to unblock the bottleneck.
The current bottleneck is ip addresses.
I currently have 2 more requests in review, and I've done my best to escalate them:
In order to avoid asking for so many IP addresses, I moved the docker image to gcr and put "noAddress=true". after attempting to run the workflows nothing seems to be happening....the scattered task claims that is has been running for 1 hour, but the shards themselves are not moving.
Since this is the first time I tried using "noAddress=true", I'm concerned that it isn't supported by Terra...could someone let me know if I need to be concerned?
Dan - This seems similar to another ticket I had recently. The only difference between the workflow that worked and the one that did not was setting noAddress to true.
Yossi - I have elevated this issue to a Jira ticket for a potential bug fix.
The ticket can be tracked here: https://broadworkbench.atlassian.net/browse/BA-5718
Hi Yossi -
Based on a quick scan of the slack channel, I saw that there were a few issues mentioned about noAddress=true in the gnomad development group.
There is a potential workaround, according to @markw, but it may not help with Terra specifically.
Their solution was to run a script to configure subnet settings in our google compute projects.
However, according to @ferrara, this may not work as the subnets when created in google projects by Terra may not have the flag enabled that allows private ip only instances to talk to Google apis.
OK. so I'm giving up on the noAddress=true for now. but FYI, it would be helpful for users like me if it were possible to do private IPs.
Hi Yossi,
Can we get a brief status update: Are your jobs running in general? If not, can you give us as much detail as possible to further debug the latest? Ultimately we'll be looking for Google Genomics Pipelines API Operations IDs, but we can start with Cromwell Workflow IDs if you have those available.
Thanks,
-k
UPDATE:
I'm running using up my 3000 IPs. looks like jobs are starting to trickle out. Given that I can only see in hindsight when jobs are being throttled, I can't really tell you what's going on till tomorrow.
Being able to run with noAddress=true would be a very nice addition....but right now I'm working without it.
Thanks Yossi. Let us know if you are unblocked today.
Also, The `noAddress` attribute in Terra is not working right now. We are working with Google on a plan to resolve this. We will respond back when we have that sorted.
Please sign in to leave a comment.