Best practices for sharing and protecting data resources

Allie Hajian
  • Updated

In the Data Biosphere's cloud-based bioinformatics model, researchers access data shared in a central location, rather than each making a separate copy for their own analysis. Hosted datasets in Terra's data library - both public- and restricted-access - are an example of this model. This article covers best practices for sharing data across teams and handling access to datasets that aren't hosted in Terra's Data Library but must be accessible to a group of Terra users. 

Managing access to shared assets with groups 

If you have large or changing groups, you can streamline permissions across resources (i.e. billing projects, workflow collections, workspaces and their data assets, even external buckets) by using managed groups. Assigning permissions and access to all resources to the group, instead of to individuals, means you need only make changes in one place (the group), instead of to each individual resource. 

Admins can add or remove people from the group at any time, and all workspaces and resources shared with a group will adjust appropriately. When you remove people from the group, they will no longer have access to the resources. 

To learn more about the structure of permissions, groups and billing in Terra, see this article

How to create a managed group (click for step-by-step instructions)

1. Create a group by going to your groups page from the main navigation menu at the top left:
"Your Name" --> "Groups" 
S48_Creating_groups_Screen_Shot.png

2. Add or delete members and assign admin roles to allow others in the group to be able to add or remove people by going to the "Groups" page in Terra

3. Once it's set up, you can share workspaces with the group in Terra. The members of the group will then have access to the workspace, including data in its bucket. If a person is removed from the group, such as when someone leaves a lab, they will no longer have access to the data. Note that you can also share external buckets with Groups created in Terra (since Terra is built on the GCP infrastructure). 

Sharing data resources - smaller datasets, fewer people (workspace storage) 

Terra is designed to protect your data, and you can take advantage of the platform's built-in security by storing shared data in a workspace bucket. Data is protected - you must have the right credentials to access data, regardless of where it lives on the platform. And access, once you set it up, applies across the entire platform.

For example, if data are in a workspace bucket and you have reader, writer, or owner permissions to that workspace, you can access the data from any of your workspaces in Terra. Note that if the data are restricted-access, you will also need to be in all of the required Authorization Domain(s). 

To make the data accessible to your collaborators, follow the following steps:

  1. Store data in workspace bucket (see this article for more details)
  2. Share the workspace with collaborators or group (see below to learn more about sharing with a managed group)
  3. Grant permission to individual users -  or the group - reader, writer, or owner permission  

Sharing the workspace where the data are stored with other Terra users is the only way to share data stored in workspace buckets in Terra.

It's not possible, in the interface, for Terra users to make a workspace bucket public for people outside the workspace or outside the Terra platform. Note that Terra admins are able to make a bucket public to registered users for access on Terra.

Who can access shared data in a workspace bucket and who pays?

You can control access to your data when you choose with whom and at what level to share the workspace that contains the data. Anyone with reader, writer or owner permission for a workspace will be able to access the data in the associated bucket from any of their Terra workspaces. That means they can analyze the data with a workflow or in a notebook. Any generated data will be stored by default in the workspace where they are doing the analysis.

Anyone with permission to access the workspace where the shared data are stored will be able to use that data for analysis in any of their workspaces on Terra. Anything that references a full path to a data file - including in data tables and  workflow configurations will be seamless (data will appear to be local). You can use gsutil in a terminal or a notebook to copy data from the original bucket to a different workspace bucket, if needed. 

Protecting controlled-access data (Authorization Domains)

See this diagram for an example of sharing data files (who has access and who doesn't) as well as how relying upon workspace sharing permissions alone can lead to unauthorized access.
Unauthorized_access_with_sharing_permissions.png

For additional protection around restricted-access data, you can store shared data assets in the dedicated bucket of a workspace under an Authorization Domain. To enable members of your group to access the data you must first include the group in the Authorization Domain, then share the workspace. Note that the Authorization Domain requirement is inherited by all clones of the original workspace, ensuring that data in a workspace bucket under an Authorization Domain remain under the same restrictions as the workspace is shared and copied.

To learn more about setting up Authorization Domains to protect controlled-access data, see this article.
O2b_May30_2019.png

Storage costs for shared data in a workspace bucket 
The Google Billing Account associated with the Terra Billing Project of the workspace holding the data will pay for data storage as well as egress charges - if someone else downloads the data to their own bucket or local machine. 

Ensuring access to shared data in a workspace bucket

To access shared data, a user must be included in the permissions of the original workspace where the data are stored. Note that a person's Terra user ID is used to track access. See the scenario below to understand where problems can arise when a user doesn't have access to the original workspace. This can impact anyone who has multiple user IDs, for example a personal gmail as well as an institutional account.

O10b_Shared_data_scenario_in_Terra.png

  1. User A creates a workspace (workspace 1) and stores data to be shared with the group in the Workspace 1 bucket.
  2. User A gives his research group (Group X) writer permission for Workspace 1. User B is in the research group and has access to Workspace 1 and the shared data.
  3. User B makes a copy of Workspace 1 (Workspace 2), which references the data in Workspace 1 in its data table. The shared input data are not in the Workspace 2 bucket, however.
  4. User B runs the workflows cloned from Workspace 1 in Workspace 2. They run successfully because User B has access to the input data is in Workspace 1. The outputs are stored in the Workspace 2 bucket.
  5. User B shares Workspace 2 with User C, who is not in Group X (i.e. does not have access to Workspace 1 with the original shared data).
  6. User C tries running the workflows in Workspace 2, but the workflows fail because User C doesn't have access to Workspace 1 where the data are kept. 

Limitations of sharing data in workspace storage

Cloning a workspace doesn’t copy data to the new workspace bucket

Though the data tables in the cloned workspace look populated, they will still be pointing to the original workspace location of the files. Users of cloned workspaces will only be able to run an analysis on the shared data if they have (at minimum) read permission on the original workspace where the data are stored.

Troubleshooting workflows with shared data can be challenging

Because workspace bucket names are random strings, it can be hard to identify which workspace includes the actual data files from the full path (in a data table or workflow configuration). For example, could you determine which workspace contains the following file: fc-secure-7124e053-c020-4a76-a372-f1bb9272a32d/sample1.cram?
Without an easy way to identify the original workspace, it can be challenging to troubleshoot workflows that are failing because of permissions issues (if the permissions in the two workspaces are different, for example). Permissions could be different even for the same user if they use one login (credential) in one workspace, and another login (with a second user ID and credentials) in the second. The user may be the same, but the lack of the proper credentials in the second workspace could mean workflows that use data in the first for input would fail, apparently for no reason. 
We recommend documenting shared data locations - Include the name or link for the workspace where shared data  are stored in the dashboard to ensure that any cloned workspaces can trace back to the original.

Users cannot make a workspace bucket public without a request to the Terra team

This means all collaborators must have reader, writer, or owner permissions on the workspace where the data are stored for access. Period. 

Sharing data resources - large datasets, multiple studies (external buckets)

To enable shared data resources that others can use without having to copy large data files to hundreds of workspaces while avoiding hosting (i.e. paying for) large data files for other users or projects is to use an external, requester-pays bucket for the data. 

Advantages of using external buckets for sharing large datasets

For large datasets used across multiple studies, we recommend using external Google buckets (may be requester pays buckets). You control access to the buckets by assigning individual or group permissions, much like you do in Terra. For external buckets, however, you would do this in the Google Cloud Platform console instead of directly in Terra. 

Using external buckets 

  • Keeps you from losing shared data if someone in the group inadvertently deletes the workspace
  • Helps keep track of data because you can name the external GCP bucket storing the data, rather than using the random-string names of workspace buckets. 
  • Makes the data accessible while minimizing cost. The host would only pay storage costs (people who wanted to download would pay for that themselves) 

A downside to using external buckets is that this approach may circumvent Terra's built-in security around accessing data, and put responsibility for security solely on the data owner.

For help with setting up external and requester pays workspace buckets, please contact support ("Support" --> "Contact Us" in the main menu dropdown at the top left of any page in Terra).

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.