Issue with accessing the external bucket for ImportGVCFs task in GATK joint-genotyping workflow
Hi,
I am running Joint-genotyping workflow on Terra. The ImportGVCFs task was performed in parallel with shards. However, not all the shards were run successfully. I got the error message for the Google cloud permission (see attached pic). Since it should be the same sample set (gvcfs) across all the shards, I don't understand why certain shards encountered the permission issue while some did not. I have previously run the same sample set with the same workflow and did not encounter this issue with permission.
Not sure what is the issue here and how to solve it.
Comments
12 comments
Hi Zih-Hua,
Thanks for reaching out. The errors you are receiving ("pet-account@cardterra.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket") indicate that you do not have access to "get" or retrieve the files from a particular Google bucket. It is certainly strange that this only happened for a few of the shards in your workflow.
If you re-run your workflow do the same shards fail?
Kind regards,
Emil
Hi Emil,
Thank you for following up on this issue. I understand that the "get" access for the Google bucket. However, the ImportGVCFs task run on the same sample map with just different intervals (shards), so I don't get why not all the shards failed if it's the permission issue.
I re-ran the same workflow with the same sample map, and the different shards failed. It seems to be a random behaviour?
I ran the same sample map a while ago and did not encounter this issue, so I do not think it's a permission issue.
Best wishes,
Zih-Hua
Hi Zih-Hua,
Sorry for the delay getting back to you. Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.
Kind regards,
Emil
Hi Emil,
I have shared the workspace (name: GP2 Sequencing).
Here are the relevant submission IDs (I tested the same workflow and the same input for three times):
ee826f08-f778-484c-9f96-efc4fac63906 (workflow ID: daf1c6e4-96cb-4439-b9dc-4a0fed9c78c3)
8d7c0fdd-3ae6-4bfe-b435-8d8ef14bc7d9 (workflow ID: 57457293-2d1b-4817-a667-df4491a6fd9e)
abc9ccbf-600a-4828-8369-2151507a50d8 (workflow ID: 55cf78d6-580d-4015-b7c6-5505d50e7dde)
All three submissions are with the comment: "test amp_pd bucket access" in job history.
Thanks a lot for looking into. Please let me know if you need more infos.
Best wishes,
Zih-Hua
Hi Zih-Hua,
We've taken a closer look at your submissions, immediately we realized that the number of samples for each job was very large (10,000+). We haven't been able to pinpoint exactly what the issue is yet but we might have come up with a possible workaround. Since it appears different shards are failing each time we suspect that the issue is not permissions related, despite the 403 error you have been receiving.
We noticed that you have call-caching enabled and that for each successive job the number of shards that passed increased. Due to a large number of samples in your workflow, you might be reaching some kind of Google-enforced limit. If you continue to submit your workflow with call-caching enabled I'm curious if the number of shard passing would continue to increase - eventually completing the job. Let us know if you think this might be a viable solution.
You can check how many shards passed by looking at the workflow dashboard as pictured below, we were unable to get the job manager to load.
We will continue looking for a definite solution, I thought that in the meantime you might like to try this and see if it works.
Kind regards,
Emil
Dear Emil,
I understand your point for the temporary solution. I will continue to submit the workflow with call-caching enabled while waiting for the definite solution.
I just want to point out that I have run this workflow successfully without any issue back in September (Submission ID: 219a2fff-5e91-4f97-99fe-e6ff64e0e1af and workflow ID: 614c02c7-1c58-4e8e-8245-c717212a3796) with the same amount of the jobs. Unfortunately, I deleted the outputs from the workspace bucket so cannot look up the logs in detail if you need to.
Best wishes,
Zih-Hua
Hi Emil,
I want to report that when I increase the number of the samples in the sample map, the permission issue happened across all the shards.
In my previous tests (the ones that I reported the submission IDs above), I only included 5 samples in the sample map to test the access of the google bucket. Now, I included all the samples I wanted to run (n=582) in the sample map, and I have permission issue across all the shards (Submission dd5fc58b-dc7f-4df2-a862-fd500f3a90ac Workflow 1da50fde-9cc8-4b41-ae90-4ff2a4e3e860). So, enabling call-caching does not work for me. I checked my access to all the files in the google bucket one by one by using `gsutil ls $path`, and I did not have any issue to access the files.
I am not sure what to do to be able to run the workflow with my dataset.
Best wishes,
Zih-Hua
Hi Zih-Hua,
No worries, let us see if we can find an alternate solution to your problem.
A member of our engineering team was able to pull the metadata script from the successful run of your workflow that you shared with us (Submission ID: 219a2fff-5e91-4f97-99fe-e6ff64e0e1af) and compare it to the metadata script for one of your failed "test amp_pd bucket access" jobs. Please see the following link to Diffchecker, which allows us to easily compare your successful and failed workflow scripts: https://www.diffchecker.com/FLiByBTR . The successful metadata script is on the left, and the failed metadata script is on the right.
Would you be able to compare the differences in these two scripts to help with troubleshooting your workflow?
We will continue to look into your submissions to see if we can figure out an alternate solution. If you have any other questions please let us know.
Kind regards,
Emil
Hi Emil,
I have checked the difference, and there was no difference for the script block of this particular task (task ImportGVCFs).
Best wishes,
Zih-Hua
Hi Zih-Hua,
Thank you for confirming that for me.
Your workflows have been failing on the following command in your ImportGVCFs task:
The file it's attempting to stream is likely the path found in sample_name_map.
Despite having run this workflow on the same samples before - I now believe that the cause of the errors you are receiving is that your pet-service account does not have access to all of the files in the Google bucket you are using (gs://amp-pd-genomics).
Proposed solution:
First - I would recommend that you go to your Google bucket and make sure that your pet-service account has the proper permissions to access all of your files.
Is the pet-service account you are using one that was created outside of Terra?
If so, you will need to register your service account with Terra - details for which can be found at the bottom of the following article under "Sharing a workspace with a service account": How to share a workspace
Essentially: You will want to navigate to the following Github repo and follow the instructions provided for registering a service account: https://github.com/broadinstitute/terra-tools/tree/master/scripts/register_service_account. Note that service accounts cannot be created in Firecloud-created Google Projects, the service account will need to be created in a different Google Project.
After registering your service account you can then share your workspace with the service account. You will also need to add the service account to the authorization domain if the workspace has one.
If you have any other questions please let us know.
Kind regards,
Emil
Hi Emil,
I would like understand more about the pet-service account access to the Google bucket (gs://amp-pd-genomics).
I ran a test on Terra terminal to check if I have access to all the files in sample_name_map (see attached pic).
I did not encounter any error message telling me that I did not have access to the files. As I understood, this means that my pet-service account can access all the files in the said bucket. Is this correct?
I did not set the Authorization Domain to my workspace. Is this the issue?
Thanks.
Zih-Hua
Hi Zih-Hua,
The command you used shows that you have access to all of the files inside of the sample_name_map folder - I am wondering if you might still be missing access to the other folders in the gs://amp-pd-genomics bucket. Would it be possible for you to check this?
The following article in our documentation explains how to access data from an external bucket: Accessing data from an external bucket.
You will need to:
The complete instructions for both of these steps can be found in the article linked above.
I apologize for not getting to the bottom of this more quickly, the solution that I proposed in my last message was for if you are using an external pet service account to access files in your Terra Google bucket.
If you have any other questions please let us know!
Kind regards,
Emil
Please sign in to leave a comment.