How to create snapshots in TDR (GCP)

Anton Kovalsky
  • Updated

Step-by-step instructions for creating a Snapshot of a TDR dataset: a subset of the data at a specific point in time, which can be analyzed on the Cloud and shared with other researchers. 

This article explains how to create Snapshots of Google-backed TDR datasets. To learn how to create Snapshots of Azure-backed TDR datasets, see How to create snapshots in TDR (Azure)

Snapshots overview

Snapshots in TDR streamline data delivery and sharing. Previously, data inaccessible through Terra's data library or featured workspaces could only be shared if you had access to the workspace where that data was already staged.

The Terra Data Repository (data.terra.bio) allows data custodians (e.g. PIs, sequencing centers, research organizations, etc.) to grant access to large datasets without worrying about workspace permissions. Users with access to the dataset can create subsets - called snapshots - of precisely the data of interest, as long as custodians have set up the datasets to include assets specifying which data columns may be included in snapshots. 

Diagram schematizing a TDR dataset with an asset and a snapshot. The dataset consists of one data table with 5 columns. Blue rectangles highlight 3 of these columns, which comprise an asset. Orange rectangles highlight individual rows (queries). Green rectangles highlight individual cells at the overlap between the asset and the queries -- these represent the contents of the snapshot.

Snapshot permissions

You can control who has access to a TDR snapshot, and what kind of access they have, by assigning users (or groups of users) a specific role.

  • Snapshot Steward: can view the snapshot, view and edit who has access to the snapshot and their permissions, and export the snapshot's data to Terra.
  • Snapshot Reader: can view the snapshot and export the snapshot's data to Terra. Snapshot readers cannot view or edit who has access to the snapshot.
  • Snapshot Discoverer: can see that the snapshot exists, and therefore can identify data snapshots that might be relevant to their work and request access to them. However, snapshot discoverers cannot view the data in the snapshot, export it to Terra, or view or edit who has access to the snapshot.

Three ways to create Snapshots

To create Snapshots, you must first have access to a dataset in the Data Repo. You can see a full list of datasets to which you have access in the Datasets tab on the Terra Data Repo website.

Option 1. Create Snapshots on the Data Repository's Website

If you're not very familiar with using API endpoints, you can create snapshots through the Data Repository's website. 

1.1. Select a dataset

After you log into data.terra.bio, click on the Datasets tab to see all of the datasets that you have access to. Click on the name of a dataset to view the contents.

Then, click on View Dataset Data to see the dataset's data tables:

Screenshot of an example TDR dataset on the TDR website. An orange rectangle highlights the 'view dataset data' button at the top left of the screen.

You can toggle between tables using the dropdown menu near the top left of your screen:

Screenshot of an example TDR dataset on the TDR website. An orange rectangle highlights the drop-down menu used to select a specific table in the dataset to view.

1.2. Filter data for your Snapshot

Once you've clicked on view dataset data to see your dataset's tables, you will see buttons that open dataset information, and Snapshot creation menus on the right-hand panel of your screen.

Screenshot of the view of a table in an example dataset on the TDR website. An orange rectangle and arrow highlight the filter button on the right-hand panel of the screen, which looks like an inverted triangle made up of three horizontal lines.

Clicking the dashed triangle icon will open the Snapshot creation widget, where you'll be able to select your desired data based on the available filters. You can use the search filters search bar to locate a specific column name (note that the search bar is case sensitive).

Screenshot of the menu used to select the data rows to include in a TDR Snapshot.

1.3. Select assets for your Snapshot

The asset specifies which columns from the dataset to include in your Snapshot.

Once you've filtered for your desired rows, click "Create Snapshot" at the bottom of the widget. You'll name your Snapshot and select an asset in the Add Details pane (screenshot below).

Screenshot of the menu used to select the Snapshot's asset.

  • If you're not sure which assets include which data, you can look it up using the retrieveDataset API endpoint endpoint in Swagger. The dataset's assets will be stored in an "assets" field for each table, which will list the columns that belong to that asset.

    Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

    1. Once you've authenticated yourself on the Swagger page, click "Try it out" in the top right corner of the API endpoint.

    2. Once it's active, use the UUID for the dataset you're interested in as the input for the UUID field - you can find this UUID in the dataset's summary tab on the TDR website - and select SCHEMA from the menu right beneath:

    Screenshot of the dataset summary page for an example TDR dataset. An orange rectangle highlights the Dataset ID field.

    3. Scroll down to where you can click execute.

    4. Then scroll down to the response body and scroll through it until you see the "assets" section of the schema. You'll see a list of assets and within that list each asset will also have a list of the names of the columns included in that asset.

    Screenshot of the retrieveDataset Swagger API endpoint. A red arrow highlights the 'Execute' button used to submit a retrieveDataset API job. Another red arrow highlights the 'assets' field in the response body, and a red rectangle highlights the contents of an example asset.

1.4. Share the Snapshot

Once you've selected your asset and named your Snapshot, clicking "Next" will take you to the Data Release view, where you can add other Terra users (including groups).

Screenshot of the screen used to share a TDR snapshot with other users.

To add a user (or group of users) to the snapshot, enter their Terra id (the email associated with their Terra account) in the People field and select their role from the Permissions drop-down menu.

Once you've added a user to a snapshot, they will be able to see it under the "Snapshots" tab when they log into data.terra.bio.

1.5. Create the Snapshot

Click Create snapshot to create the snapshot. 

Option 2. Create a Snapshot using the Swagger API

While the UI option is nice - especially for exploratory purposes - using the API can be an efficient way to build exactly the Snapshot you want. You can also use API calls programmatically to automatically generate Snapshots on a regular cadence (for example, as new data come in). 

There are several flavors of Snapshot creation when using API requests: a full-view Snapshot (i.e., entire dataset), by row ID, or by inclusion criteria using a SQL query. All three use the createSnapshot API endpoint.

Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

Prerequisites

  • You may need a Profile ID, which you get by generating a Spend Profile. If you do not specify a profile id, TDR will use your default profile id.
  • If you want to create and share your snapshot in the same step, you need to have a list of stewards, readers, and/or discoverers ready - this should be in the form of an  array of Terra identities (emails). You can also share the snapshot after it has been created, using the addSnapshotPolicyMember endpoint.
  • Often, the simplest and most convenient Snapshot to share is one that contains the entire dataset. This is known as "full view" mode. To create a Snapshot in this mode, adapt the .JSON below.

    createSnapshot API request body ("full view" snapshot)

    {
    "name": "full_view_example_snapshot",
    "description": "full view snapshot of example DR Dataset",
    "profileId": "/*your preferred Spend Profile ID*/",
    "contents": [
    {
    "datasetName": "tdr_example_dataset",
    "mode": "byFullView"
    }
    ],
    "policies": {
    "readers": [
    "example@email.com",
    "example2@email.com"
    ]
    }
    }

    Once executed, anyone listed as a reader or steward in the "policies" field will be able to see the snapshot under their "Snapshots" tab in the TDR web interface.

  • Another way to create a Snapshot is to provide the Data Repo row IDs and columns names to include in the Snapshot, for every table that should be included.

    The row IDs can be obtained by querying BigQuery

    SELECT datarepo_row_id
    FROM `my-project.my-dataset.my-table`
    WHERE column1 = "value"

    createSnapshot API request body (by row ID)

    {
    "name": "my_row_id_snapshot", "profileId": "/*your Spend Profile ID*/",   "contents": [     {       "mode": "byRowId",       "datasetName": "my_dataset",       "rowIdSpec": {         "tables": [           {             "tableName": "my_first_table",             "columns": ["column1", "column2"],             "rowIds": ["1111-2222-3333", "333-2222-1111"]           },           {             "tableName": "my_second_table",             "columns": ["column_a", "column_b"],             "rowIds": ["AAAA-BBBB-CCCC", "CCCC-BBBB-AAAA"]           }
          ]       }
      }   ] }
  • Using inclusion criteria to define is another mode for creating a Snapshot. To do this, you will convert a BigQuery-supported SQL query directly into a Snapshot.

    createSnapshot API request body (SQL query)

    {   "contents":[
         {
            "datasetName":"encode",
            "mode":"byQuery",
            "querySpec":{
               "assetName":"default",
               "query":"SELECT encode.read_groups.datarepo_row_id FROM encode.bams WHERE encode.bams.create_date > '2021-05-06T04:00:00'"
            }      }   ],   "description":"Encode Aug 2021 release”,   "name":"encode_Aug2021_release",   "profileId":"<uuid>",   "policies":{     "readers": ["example1@email.com", "example2@email.com"]   } }

Option 3. Create a Snapshot using Zebrafish

Zebrafish is only available for Google-backed data If your data are staged in a Google bucket and your TDR dataset's cloud provider is Google, you can use Zebrafish to create, edit, and snapshot your data. Otherwise, you must use either the TDR web interface or the Swagger API endpoints.

When uploading data to your repository using the Zebrafish interface, you can choose to create a full view snapshot at the same time. A full view snapshot includes the entire dataset, rather than allowing you to share specific columns or rows. Note that you cannot use Zebrafish to create a snapshot after uploading data, only simultaneously.

1. Log into Zebrafish

2. Click on the pipeline monitoring icon at the top left of the page:

Screenshot of the pipeline monitoring dashboard on Zebrafish. An orange box highlights a wavy line at the top left of the page, which is the button used to navigate to the pipeline monitoring dashboard.

3. Click on New Ingestion at the top right of the page:

Screenshot of the pipeline monitoring dashboard on Zebrafish. An orange box highlights the New Ingestion button at the top right of the page.

4. When filling out the ingestion details for the new ingestion, click on the toggle next to CREATE SNAPSHOT.

Screenshot of the form used to create a new ingestion on Zebrafish. An orange box highlights the Create Snapshot toggle, which is used to specify whether you wish to create a snapshot of the entire dataset when you upload the data.

5. Continue creating your dataset by following the instructions in How to create a TDR dataset and ingest data with Zebrafish.

What to expect

Once you've successfully created your snapshot, you should be able to see it under the Snapshots tab on the Terra Data Repository website, and under the specific dataset's own Snapshots tab.

Screenshot showing the contents of an example 'Snapshots' tab on the Terra Data Repository website. An orange box highlights the 'Snapshots' tab at the upper left of the screen.

 

Screenshot showing the contents of the 'Snapshots' tab for an example dataset on the Terra Data Repository website. An orange box highlights the 'Snapshots' tab for this specific dataset.

Note that Snapshots are immutable - they cannot be subsetted and Terra cannot over-write them, including adding new data during an analysis.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.