Best practices for sharing and protecting data resources

Allie Hajian
  • Updated

In the Data Biosphere's cloud-based bioinformatics model, researchers access data shared in a central location, rather than each making a separate copy for their own analysis. Hosted datasets in Terra's data library - both public- and restricted-access - are an example of this model. This article covers best practices for sharing data across teams and handling access to datasets that aren't hosted in Terra's Data Library but must be accessible to a group of Terra users. 

Managing access to shared assets with groups 

If you have large or changing groups, you can streamline permissions across resources (i.e. billing projects, workflow collections, workspaces and their data assets, even external buckets) by using managed groups. Assigning permissions and access to all resources to the group, instead of to individuals, means you need only make changes in one place (the group), instead of to each individual resource. 

Admins can add or remove people from the group at any time, and all workspaces and resources shared with a group will adjust appropriately. When you remove people from the group, they will no longer have access to the resources. 

To learn more, see Managing shared resources with groups and permissions

How to create a managed group (step-by-step instructions)

1. Create a group by going to your groups page from the main navigation menu at the top left:
Your Name > Groups
S48_Creating_groups_Screen_Shot.png

2. Here you can add or delete members and assign admin roles to allow others in the group to be able to add or remove people.

3. Once it's set up, you can share workspaces with the group in Terra. Group members will then have access to the workspace, including data in its bucket. If a person is removed from the group, such as when someone leaves a lab, they will no longer have access to the data.

Note that you can also share external buckets with Groups created in Terra (since Terra is built on the GCP infrastructure). To share with a Terra group, use <group-name>@firecloud.org when assigning roles in GCP console. 

Sharing smaller datasets, fewer people (workspace storage) 

Terra is designed to protect your data, and you can take advantage of the platform's built-in security by storing shared data in a workspace bucket. Data is protected, meaning you must have the right credentials to access data, regardless of where it lives on the platform. And access, once you set it up, applies across the entire platform.

For example, if data are in a workspace bucket and you have reader, writer, or owner permissions to that workspace, you can access the data from any of your workspaces in Terra. Note that if the data are restricted-access, you will also need to be in all of the required Authorization Domain(s)

How to make the data accessible to your collaborators (step-by-step)

1. Store data in workspace bucket (see Moving data between local storage and the workspace bucket for more details).

2. Share the workspace with collaborators or a group (scroll down below to learn more about sharing with a managed group).

3. Grant permission to individual users -  or the group - reader, writer, or owner permission.

How to share data in a workspace bucketSharing the workspace where the data are stored with other Terra users is the only way to share data stored in workspace buckets in Terra

It's not possible in Terra to make a workspace bucket public for people outside the workspace or outside the Terra platform. Note that Terra admins are able to make a bucket public to registered users for access on Terra.

Who can access shared data in a workspace bucket and who pays?

Workspace owners control access to data when by choosing with whom and at what level to share the workspace that contains the data. Anyone with reader, writer or owner permission for a workspace will be able to access the data in the associated bucket from any of their Terra workspaces. Anything that references a full path to a data file - including in data tables and workflow configurations will be seamless (data will appear to be local). You can use gsutil in a terminal or a notebook to copy data from the original bucket to a different workspace bucket, if needed. Any generated data will be stored by default in the workspace where they are doing the analysis.

Storage costs for shared data in a workspace bucket The Google Billing Account associated with the Terra Billing Project of the workspace holding the data will pay for data storage as well as egress charges - if someone else downloads the data to their own bucket or local machine.

To avoid egress charges, use an external requester pays bucket. For more detail, see Configure Google Cloud Storage to prevent egress charges

Ensuring access to shared data in a workspace bucket

To access shared data, a user must be included in the permissions of the original workspace where the data are stored. Note that a person's Terra user ID is used to track access. See the scenario below to understand where problems can arise when a user doesn't have access to the original workspace. This can impact anyone who has multiple user IDs, for example a personal gmail as well as an institutional account.

O10b_Shared_data_scenario_in_Terra.png

  1. User A creates a workspace (workspace 1) and stores data to be shared with the group in the Workspace 1 bucket.
  2. User A gives his research group (Group X) writer permission for Workspace 1. User B is in the research group and has access to Workspace 1 and the shared data.
  3. User B makes a copy of Workspace 1 (Workspace 2), which references the data in Workspace 1 in its data table. The shared input data are not in the Workspace 2 bucket, however.
  4. User B runs the workflows cloned from Workspace 1 in Workspace 2. They run successfully because User B has access to the input data is in Workspace 1. The outputs are stored in the Workspace 2 bucket.
  5. User B shares Workspace 2 with User C, who is not in Group X (i.e. does not have access to Workspace 1 with the original shared data).
  6. User C tries running the workflows in Workspace 2, but the workflows fail because User C doesn't have access to Workspace 1 where the data are kept. 

Protecting controlled-access data (Authorization Domains)

Relying upon workspace sharing permissions alone can lead to unauthorized access! This can happen when someone with access makes a copy and shares with someone not authorized for the primary workspace. 

Unauthorized_access_with_sharing_permissions.png

For additional protection around restricted-access data, you can store shared data assets in the dedicated bucket of a workspace under an Authorization Domain. To enable members of a group to access the data, you must first include the group in the Authorization Domain, then share the workspace. Note that the Authorization Domain requirement is inherited by all clones of the original workspace, ensuring that data in a workspace bucket under an Authorization Domain remain under the same restrictions as the workspace is shared and copied.O2b_May30_2019.png

To learn more about protecting controlled-access data, see Managing data privacy and access with Authorization Domains.

Limitations of sharing data in workspace storage

Cloning a workspace doesn’t copy data to the new workspace bucket

Though the data tables in the cloned workspace look populated, they will still be pointing to the original workspace location of the files. Users of cloned workspaces will only be able to run an analysis on the shared data if they have (at minimum) read permission on the original workspace where the data are stored.

Troubleshooting workflows with shared data can be challenging

Because workspace bucket names are random strings, it can be hard to identify which workspace includes the actual data files from the full path (in a data table or workflow configuration). For example, could you determine which workspace contains the following file: fc-secure-7124e053-c020-4a76-a372-f1bb9272a32d/sample1.cram?
Without an easy way to identify the original workspace, it can be challenging to troubleshoot workflows that are failing because of permissions issues (if the permissions in the two workspaces are different, for example). Permissions could be different even for the same user if they use one login (credential) in one workspace, and another login (with a second user ID and credentials) in the second. The user may be the same, but the lack of the proper credentials in the second workspace could mean workflows that use data in the first for input would fail, apparently for no reason. 
We recommend documenting shared data locations - Include the name or link for the workspace where shared data  are stored in the dashboard to ensure that any cloned workspaces can trace back to the original.

Users cannot make a workspace bucket public without a request to the Terra team

This means all collaborators must have reader, writer, or owner permissions on the workspace where the data are stored for access. Period. 

Sharing large datasets, multiple studies (external buckets)

To share data resources that others can use without having to copy large data files to hundreds of workspaces while avoiding hosting (i.e. paying for) large data files, use an external, requester-pays bucket for the data. 

Advantages of using external buckets for sharing large datasets

For large datasets used across multiple studies, we recommend using external Google buckets (may be requester pays buckets). You control access to the buckets by assigning individual or group permissions, much like you do in Terra. For external buckets, however, you would do this in the Google Cloud Platform console instead of directly in Terra. 

Using external (requester pays) buckets  Keeps you from losing shared data if someone in the group inadvertently deletes the workspace

Helps keep track of data because you can name the external GCP bucket storing the data, rather than using the random-string names of workspace buckets. 

Makes the data accessible while minimizing cost. The host would only pay storage costs (people who wanted to download would pay for that themselves) 

A downside to using external buckets is that this approach may circumvent Terra's built-in security around accessing data, and put responsibility for security solely on the data owner.

For help with setting up external and requester pays workspace buckets, please contact support (Support > Contact Us in the main menu dropdown at the top left of any page in Terra).

Additional resources

See Configure Google Cloud Storage to prevent egress charges. and other articles in the Data Submitters Resources section. 

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.