Step-by-step instructions for selecting the subset of data (a snapshot) in a TDR asset for analysis.
Snapshots overview
Snapshots in TDR streamline data delivery and sharing. Previously, data unaccessible through Terra's data library or a featured workspaces could only be shared if you had access to the workspace where that data was already staged, by uploading metadata-containing spreadsheets to a workspace data tab.
The Terra Data Repository (data.terra.bio) allows data custodians (e.g. PIs, sequencing centers, research organizations, etc.) to grant access to large datasets without worrying about workspace permissions. Users with access to the dataset can create subsets - called snapshots - of precisely the data of interest, as long as custodians have set up the datasets to include assets specifying which data columns may be included in snapshots.
Two ways to create Snapshots
- Snapshot by copy: In the Terra Data Repo (in a browser)
- Snapshot by reference: Using the Terra Data Repo's Swagger API
Snapshot by copy (in TDR in a browser) caveats
- The snapshot shows up in a table in the workspace with the same name as the original dataset.
- The table has an additional column import:snapshot_id.
- These tables work like any other workspace data table; data can be used as input for workflows, output files can be written back to the table, etc.
Snapshots by reference (using the Swagger API) caveats
- Snapshots show up in the SNAPSHOTS section at the left in the Data page.
- The snapshots are immutable - they cannot be subsetted and Terra cannot write output to these (this is a data custodian requirement - in the case where a patient requests to withdraw, for example).
- The API approach gives you more granular control over how your Snapshot is created.
- You can use API calls programmatically to automate regular/periodic Snapshot creation.
- Because the Terra Data Repo's UI is a beta-stage product, using the API may help avoid downtime as we work out the kinks.
Option 1. Create Snapshots in the Data Repository
To create Snapshots, you must first have access to a dataset in the Data Repo. You can see in the Data Repo homepage (data.terra.bio) some datasets to which you've been recently granted access. For a full list of datasets to which you have access, navigate to the "Datasets" tab.
1.1. Browse datasets
Click on the name of a dataset to view the contents. You can toggle between whatever separate tables the dataset contains using the dropdown menu bar near the top left of your screen:
1.2. Filter data for your Snapshot
On the right edge of the screen are buttons that open dataset information, Snapshot creation, and sharing widgets.
Clicking the dashed triangle icon will open the Snapshot creation widget, where you'll be able to select your desired data based on the available filters. Note that the search bar is case sensitive, and only searches among the column names, not the metadata in the rest of the table.
1.3. Select assets for your Snapshot
The asset specifies which columns from the dataset to include in your Snapshot.
Once you've filtered for your desired rows, click "Create Snapshot" at the bottom of the widget. You'll name your Snapshot and select an asset in the Add Details pane (screenshot below).
Checking which assets include which data
If you're not sure which assets include which data, you could look it up using the retrieveDataset API endpoint endpoint in Swagger.
Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.
Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.
For a more detailed description of this authentication step, see this article on Authenticating in Swagger.
1. Once you've authenticated yourself on the Swagger page, click "Try it out" in the top right corner of the API endpoint.
2. Once it's active, use the UUID for the dataset you're interested in as the input for the UUID field - you can find this UUID in the URL bar when you're looking at the dataset in the Data Repo UI - and select SCHEMA from the menu right beneath:
3. Scroll down to where you can click execute.
4. Then scroll down to the response body and scroll through it until you see the "assets" section of the schema. You'll see a list of assets and within that list each asset will also have a list of the names of the columns included in that asset.
1.4. Create the Snapshot
Once you've selected your asset and named your Snapshot, clicking "Next" will take you to the Data Release view, where you can add other Terra users (including groups) so that they'll be able to see the Snapshot under the "Snapshots" tab when they go to the Data Repo UI. Clicking "Release Dataset" will create the Snapshot.
What to expect
Once you've successfully completed this step, the Snapshot has been created, and you should be able to see it under the Snapshots tab (https://data.terra.bio/snapshots).
Option 2. Create a Snapshot using the Swagger API
While the UI option is nice - especially for exploratory purposes - using the API can be an efficient way to build exactly the Snapshot you want. There are several flavors of Snapshot creation when using API requests: a full-view (i.e., entire dataset), by row ID, or by inclusion criteria.
All three use the createSnapshot API endpoint.
Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.
Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.
You should now be able to execute the commands below by clicking the Try it out button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.
Prerequisites
- You need a Profile ID, which you get by generating a Spend Profile.
- You need to have a list of readers ready - this should be in the form of an array of Terra identities (emails) with read-access to the Snapshot (the array can just be your email by itself, you can always add more readers later).
Option 2.1. Create a "full view" Snapshot
Often, the simplest and most convenient Snapshot to share is one that contains the entire dataset. This is known as "full view" mode. To create a Snapshot in this mode, use the .JSON below.
createSnapshot API request body ("full view" snapshot)
{ "name":"full_view_example_snapshot", "description":"full view snapshot of example DR Dataset", "profileId":"/*your Spend Profile ID*/", "readers":"<reader-email>", "contents":[ { "datasetName":"tdr_example_dataset", "mode":"byFullView" } ] }
Once executed, whoever was listed under the "readers" parameter should be able to see that Snapshot under the "Snapshots" tab.
Option 2.2. Create a Snapshot by row ID
Another way to create a Snapshot is to provide the Data Repo row IDs and columns names to include in the Snapshot, for every table that should be included.
The row IDs can be obtained by querying BigQuery
SELECT datarepo_row_id
FROM `my-project.my-dataset.my-table`
WHERE column1 = "value"
createSnapshot API request body (by row ID)
{
"name": "my_row_id_snapshot", "profileId": "/*your Spend Profile ID*/", "contents": [ { "mode": "byRowId", "datasetName": "my_dataset", "rowIdSpec": { "tables": [ { "tableName": "my_first_table", "columns": ["column1", "column2"], "rowIds": ["1111-2222-3333", "333-2222-1111"] }, { "tableName": "my_second_table", "columns": ["column_a", "column_b"], "rowIds": ["AAAA-BBBB-CCCC", "CCCC-BBBB-AAAA"] }
] }
} ] }
Option 2.3. Create a Snapshot by SQL query
Using inclusion criteria to define is another mode for creating a Snapshot. To do this, you will convert a BigQuery-supported SQL query directly into a Snapshot.
createSnapshot API request body (SQL query)
{ "contents":[
{
"datasetName":"encode",
"mode":"byQuery",
"querySpec":{
"assetName":"default",
"query":"SELECT encode.read_groups.datarepo_row_id FROM encode.bams WHERE encode.bams.create_date > '2021-05-06T04:00:00'"
} } ], "description":"Encode Aug 2021 release”, "name":"encode_Aug2021_release", "profileId":"<uuid>", "readers":[ “email1”, “email2” ] }