Resource unavailable (DISKS_TOTAL_GBS and CPUS) when running 5000 samples

June 11, 2019 20:36
18 comments

I've been trying to process 5000 samples (Cram->Bam for now) and it's been taking very long (I submitted on Friday (3 days ago) and I still have >1000 "running"). when the long running jobs complete, I see that they spent several days waiting for a resource (I'd copy-paste the message, but it only shows up in hover-text in the timing diagram...) either DISKS or CPU. I'm not requesting an obscene amount of disk per job (only 200 gb) nor a particularly large cpu count (3 gb)

Could you please let me know what I need to do in order to be able to actually run 5000 jobs concurrently?

Comments

18 comments

Tiffany Miller
- June 11, 2019 21:12
Hi Yossi,

The team has started looking into this and we will get back to you tomorrow morning. If you have any other questions, let us know! Thanks for posting.

0
Adelaide Rhodes
- June 11, 2019 21:23
Yossi - Could you share the workspace with GROUP_Firecloud-Support@firecloud.org?

If it is under authorization protocols. let us know.

Adelaide

0
Yossi Farjoun
- June 11, 2019 21:31
done.

0
Doug Voet
- June 11, 2019 23:53
which workspace is it?

0
Adelaide Rhodes
- Edited June 12, 2019 01:19
HI Yossi -

I heard back from the Cromwell team that this might be a quota issue. There are currently 2,344 cpus in use, quota is 2400.

Have you requested an increase in your quota to accommodate this job?

Is there a way to subdivide the jobs to not be so close to the quota limit?

Or, you could request more quota, I suppose.

CPUs and persistent disk quotas: what are they and how do you request more?

Please let me know if this answers the question.

Adelaide

0
Yossi Farjoun
- June 12, 2019 14:35
hmmm. reading that page, I am unclear about the definition of "project":

In Terra I can only find the word "project" next to "billing" as in "billing project". my billing project (for this workspace) is "broad-dsp-prod-special-cases" but putting that into the url doesn't work.

Perhaps there should be a link to the quotas page directly in Terra?

0
Adelaide Rhodes
- June 12, 2019 14:57
I think this has been sorted out now that you have found the project name in the Google console, is that correct?

0
Dan Billings
- June 12, 2019 14:57
We can request quota increase.

You are limited to 2400, what would you like the limit to be? I.e. ( [number of cpus] x [jobs] ), maybe 10K?

0
Dan Billings
- June 12, 2019 14:59
The relevant url is
https://console.cloud.google.com/iam-admin/quotas?project=broad-dsp-prod-special-cases&service=compute.googleapis.com

look for compute service cpus in us-central1

0
Adelaide Rhodes
- Edited June 12, 2019 18:56
FYI billing project owners and those with firecloud.org accounts can view project quotas at https://console.cloud.google.com/iam-admin/quotas?project=[PROJECTNAME]

0
Dan Billings
- June 12, 2019 16:34
I've gotten quota increases for # of cpus to unblock the bottleneck.

The current bottleneck is ip addresses.

I currently have 2 more requests in review, and I've done my best to escalate them:
- ip addresses 2300 -> 10k
- persistend disk 200 TB -> 1 PB
0
Yossi Farjoun
- June 13, 2019 15:37
In order to avoid asking for so many IP addresses, I moved the docker image to gcr and put "noAddress=true". after attempting to run the workflows nothing seems to be happening....the scattered task claims that is has been running for 1 hour, but the shards themselves are not moving.

Since this is the first time I tried using "noAddress=true", I'm concerned that it isn't supported by Terra...could someone let me know if I need to be concerned?

0
Adelaide Rhodes
- Edited June 13, 2019 16:44
Dan - This seems similar to another ticket I had recently. The only difference between the workflow that worked and the one that did not was setting noAddress to true.

Yossi - I have elevated this issue to a Jira ticket for a potential bug fix.

The ticket can be tracked here: https://broadworkbench.atlassian.net/browse/BA-5718

0
Adelaide Rhodes
- June 13, 2019 17:29
Hi Yossi -

Based on a quick scan of the slack channel, I saw that there were a few issues mentioned about noAddress=true in the gnomad development group.

There is a potential workaround, according to @markw, but it may not help with Terra specifically.

Their solution was to run a script to configure subnet settings in our google compute projects.

However, according to @ferrara, this may not work as the subnets when created in google projects by Terra may not have the flag enabled that allows private ip only instances to talk to Google apis.

0
Yossi Farjoun
- June 13, 2019 17:32
OK. so I'm giving up on the noAddress=true for now. but FYI, it would be helpful for users like me if it were possible to do private IPs.

0
Khalid Shakir
- June 13, 2019 20:12
Hi Yossi,

Can we get a brief status update: Are your jobs running in general? If not, can you give us as much detail as possible to further debug the latest? Ultimately we'll be looking for Google Genomics Pipelines API Operations IDs, but we can start with Cromwell Workflow IDs if you have those available.

Thanks,
-k

0
Yossi Farjoun
- June 13, 2019 20:18
UPDATE:

I'm running using up my 3000 IPs. looks like jobs are starting to trickle out. Given that I can only see in hindsight when jobs are being throttled, I can't really tell you what's going on till tomorrow.

Being able to run with noAddress=true would be a very nice addition....but right now I'm working without it.

0
Tiffany Miller
- June 14, 2019 14:20
Thanks Yossi. Let us know if you are unblocked today.

Also, The `noAddress` attribute in Terra is not working right now. We are working with Google on a plan to resolve this. We will respond back when we have that sorted.

0

Please sign in to leave a comment.