Step-by-step instructions for creating a Snapshot of a TDR dataset: a subset of the data at a specific point in time, which can be analyzed on the Cloud and shared with other researchers.
This article explains how to create Snapshots of Azure-backed TDR datasets. To learn how to create Snapshots of Google-backed TDR datasets, see How to create snapshots in TDR (GCP).
Snapshots overview
Snapshots in TDR streamline data delivery and sharing. Previously, data inaccessible through Terra's data library or featured workspaces could only be shared if you had access to the workspace where that data was already staged.
The Terra Data Repository (data.terra.bio) allows data custodians (e.g. PIs, sequencing centers, research organizations, etc.) to grant access to large datasets without worrying about workspace permissions. Users with access to the dataset can create subsets - called snapshots - of precisely the data of interest, as long as custodians have set up the datasets to include assets specifying which data columns may be included in snapshots.
Snapshot permissions
You can control who has access to a TDR snapshot, and what kind of access they have, by assigning users (or groups of users) a specific role.
- Snapshot Steward: can view the snapshot, view and edit who has access to the snapshot and their permissions, and export the snapshot's data to Terra.
- Snapshot Reader: can view the snapshot and export the snapshot's data to Terra. Snapshot readers cannot view or edit who has access to the snapshot.
- Snapshot Discoverer: can see that the snapshot exists, and therefore can identify data snapshots that might be relevant to their work and request access to them. However, snapshot discoverers cannot view the data in the snapshot, export it to Terra, or view or edit who has access to the snapshot.
Three ways to create Snapshots
- Option 1: On the Terra Data Repo website (in a browser)
- Option 2: Using the Terra Data Repo's Swagger API
- Option 3: Using Zebrafish (only available for datasets on the Google Cloud, and only for full-view Snapshots)
To create Snapshots, you must first have access to a dataset in the Data Repo. You can see a full list of datasets to which you have access in the Datasets tab on the Terra Data Repo website.
Option 1. Create Snapshots on the Data Repository's Website
If you're not very familiar with using API endpoints, you can create snapshots through the Data Repository's website.
1.1. Select a dataset
After you log into data.terra.bio, click on the Datasets tab to see all of the datasets that you have access to. Click on the name of a dataset to view the contents.
Then, click on View Dataset Data to see the dataset's data tables:
You can toggle between tables using the dropdown menu near the top left of your screen:
1.2. Filter data for your Snapshot
Once you've clicked on view dataset data to see your dataset's tables, you will see buttons that open dataset information, and Snapshot creation menus on the right-hand panel of your screen.
Clicking the dashed triangle icon will open the Snapshot creation widget, where you'll be able to select your desired data based on the available filters. You can use the search filters search bar to locate a specific column name (note that the search bar is case sensitive).
1.3. Select assets for your Snapshot
The asset specifies which columns from the dataset to include in your Snapshot.
Once you've filtered for your desired rows, click "Create Snapshot" at the bottom of the widget. You'll name your Snapshot and select an asset in the Add Details pane (screenshot below).
-
If you're not sure which assets include which data, you can look it up using the retrieveDataset API endpoint endpoint in Swagger. The dataset's assets will be stored in an "assets" field for each table, which will list the columns that belong to that asset.
Remember to authorize Swagger every time you use itSee How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.
1. Once you've authenticated yourself on the Swagger page, click "Try it out" in the top right corner of the API endpoint.
2. Once it's active, use the UUID for the dataset you're interested in as the input for the UUID field - you can find this UUID in the dataset's summary tab on the TDR website - and select SCHEMA from the menu right beneath:
3. Scroll down to where you can click execute.
4. Then scroll down to the response body and scroll through it until you see the "assets" section of the schema. You'll see a list of assets and within that list each asset will also have a list of the names of the columns included in that asset.
1.4. Share the Snapshot
Once you've selected your asset and named your Snapshot, clicking "Next" will take you to the Data Release view, where you can add other Terra users (including groups).
To add a user (or group of users) to the snapshot, enter their Terra id (the email associated with their Terra account) in the People field and select their role from the Permissions drop-down menu.
Once you've added a user to a snapshot, they will be able to see it under the "Snapshots" tab when they log into data.terra.bio.
1.5. Create the Snapshot
Click Create snapshot to create the snapshot.
Option 2. Create a Snapshot using the Swagger API
While the UI option is nice - especially for exploratory purposes - using the API can be an efficient way to build exactly the Snapshot you want. You can also use API calls programmatically to automatically generate Snapshots on a regular cadence (for example, as new data come in).
There are several flavors of Snapshot creation when using API requests: a full-view Snapshot (i.e., entire dataset), by row ID, or by inclusion criteria using a SQL query. All three use the createSnapshot API endpoint.
Remember to authorize Swagger every time you use itSee How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.
Prerequisites
- You may need a Profile ID, which you get by generating a Spend Profile. If you do not specify a profile id, TDR will use your default profile id.
- If you want to create and share your snapshot in the same step, you need to have a list of stewards, readers, and/or discoverers ready - this should be in the form of an array of Terra identities (emails). You can also share the snapshot after it has been created, using the addSnapshotPolicyMember endpoint.
-
Often, the simplest and most convenient Snapshot to share is one that contains the entire dataset. This is known as "full view" mode. To create a Snapshot in this mode, adapt the .JSON below.
createSnapshot API request body ("full view" snapshot)
{
"name": "full_view_example_snapshot",
"description": "full view snapshot of example DR Dataset",
"profileId": "/*your preferred Spend Profile ID*/",
"contents": [
{
"datasetName": "tdr_example_dataset",
"mode": "byFullView"
}
],
"policies": {
"readers": [
"example@email.com",
"example2@email.com"
]
}
}Once executed, anyone listed as a reader or steward in the "policies" field will be able to see the snapshot under their "Snapshots" tab in the TDR web interface.
-
Another way to create a Snapshot is to provide the Data Repo row IDs and columns names to include in the Snapshot, for every table that should be included.
The row IDs can be obtained by querying BigQuery
SELECT datarepo_row_id
FROM `my-project.my-dataset.my-table`
WHERE column1 = "value"createSnapshot API request body (by row ID)
{
"name": "my_row_id_snapshot", "profileId": "/*your Spend Profile ID*/", "contents": [ { "mode": "byRowId", "datasetName": "my_dataset", "rowIdSpec": { "tables": [ { "tableName": "my_first_table", "columns": ["column1", "column2"], "rowIds": ["1111-2222-3333", "333-2222-1111"] }, { "tableName": "my_second_table", "columns": ["column_a", "column_b"], "rowIds": ["AAAA-BBBB-CCCC", "CCCC-BBBB-AAAA"] }
] }
} ] } -
Using inclusion criteria to define is another mode for creating a Snapshot. To do this, you will convert a BigQuery-supported SQL query directly into a Snapshot.
createSnapshot API request body (SQL query)
{ "contents":[
{
"datasetName":"encode",
"mode":"byQuery",
"querySpec":{
"assetName":"default",
"query":"SELECT encode.read_groups.datarepo_row_id FROM encode.bams WHERE encode.bams.create_date > '2021-05-06T04:00:00'"
} } ], "description":"Encode Aug 2021 release”, "name":"encode_Aug2021_release", "profileId":"<uuid>", "policies":{ "readers": ["example1@email.com", "example2@email.com"] } }
Option 3. Create a Snapshot using Zebrafish
Zebrafish is only available for Google-backed data If your data are staged in a Google bucket and your TDR dataset's cloud provider is Google, you can use Zebrafish to create, edit, and snapshot your data. Otherwise, you must use either the TDR web interface or the Swagger API endpoints.
To learn how to use Zebrafish to create a full-view Snapshot of a TDR dataset stored on the Google Cloud, read Create a Snapshot using Zebrafish.
What to expect
Once you've successfully created your snapshot, you should be able to see it under the Snapshots tab on the Terra Data Repository website, and under the specific dataset's own Snapshots tab.
Note that Snapshots are immutable - they cannot be subsetted and Terra cannot over-write them, including adding new data during an analysis.