Step-by-step instructions for creating dataset assets, the last step before subsetting data into snapshots.
Overview
Once a dataset has been created and populated with ingested data, the last step before the data can be subsetted into snapshots is asset creation. The asset creation step is an access control that enables the data custodian to specify which columns from their dataset are available to individuals creating snapshots.
Snapshots can only be created from datasets with at least one asset, and each snapshot contains one asset.
What to expect after creating assets
You'll be able to see a list of existing assets in the Terra Data Repo UI by going to a dataset, clicking the three-dash triangle logo near the top right of the screen, and then clicking "Create Snapshot". This will call up an "Assets" dropdown that lists all the assets available for that dataset. If a dataset has no assets, the "Create Snapshot" button will be greyed out.
When creating snapshots (i.e., subsets of data), anyone added to a dataset will always see all assets for which they are authorized to access.
Sharing data with snapshots
Once a dataset has an asset, a user who has access to that dataset can create a snapshot by selecting an asset and specifying which rows they want included in their snapshot.
To allow access to one asset but not another, you would create a snapshot with that asset. Including all of the rows shares the full asset without sharing any other assets. The only columns included in the snapshot will be those specified in the asset used to create the snapshot.
Two ways to create assets
- Option 1: Use the addDatasetAssetSpecification API endpoint
- Option 2: Include the assets in your schema when creating a dataset
Option 1. Use the addDatasetAssetSpecification API endpoint
Use the addDatasetAssetSpecification API endpoint with the .JSON code below as the request body in the API. The UUID for the dataset to which you wish to add this asset is specified in a separate field in this API.
Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.
addDatasetAssetSpecification request body
{
"name": "asset3",
"rootTable": "table1",
"rootColumn": "column_1",
"follow": [],
"tables": [
{
"columns": [],
"name": "table1"
}
]
}
addDatasetAssetSpecification required parameters
- name: A name to identify the asset.
- rootTable: You must select a table from your dataset as the root table, even if you include multiple tables in the asset. The root table should be a table with data from all of the rows (for example, samples or subjects) that you plan to include in a data snapshot to share the data with other researchers. In addition, if you plan to include data from any other tables in a data snapshot, the root table should be connected to those tables via relationships specified in your dataset's schema.
-
rootColumn: The root column should be a column in the root table that you can use to filter your data when creating a data snapshot to share with other researchers. For example, if you plan to create a data snapshot with all data from a specific disease, your root column might be a
disease_id
column. -
follow (optional): A list of the relationships that link the tables that contribute to the asset. If your asset only includes one table, you can set it to
[]
as shown above.
If your asset includes more than one table, you must specify the relationship(s) between the tables in this field. Otherwise, data snapshots made using this asset will only include data from the root table.
List relationships using the names specified in the dataset's schema, in order. For example, if your root table istable_1
andtable_1
is linked totable_2
, which is linked totable_3
, your API call might include the line"follow": ["table_1_to_table_2", "table_2_to_table_3"]
. -
tables: Indicates which tables and which columns to include in the asset. To include all columns in a table, set the
columns
field to[]
.
Option 2. Include assets in your schema
You can create your datasets with assets already present. The article How to create a dataset in TDR outlines how to use the createDataset API, and the article How to create a dataset schema in TDR shows what the .JSON code for a schema looks like.
To create your dataset with the assets already present, include the JSON object (highlighted in the example below) as part of your schema at the same level as your "tables" and "relationships" objects.
Example schema JSON
"schema": {
"tables": [{
"name": "table1",
"columns": [{
"name": "column_1",
"datatype": "string"
},
{
"name": "column_2",
"datatype": "fileref"
},
{
"name": "column_3",
"datatype": "fileref"
}
]
}],
"assets": [{
"name": "asset1",
"tables": [{
"name": "table1",
"columns": [
"column_1",
"column_2"
]
}],
"rootTable": "table1",
"rootColumn": "column_1"
},
{
"name": "asset2",
"tables": [{
"name": "table1",
"columns": [
"column_1",
"column_3"
]
}],
"rootTable": "table1",
"rootColumn": "column_1"
}]
}
When creating datasets with preinstalled assets, don't forget that each asset needs to have a non-null value for the "rootTable" and "rootColumn" parameters. The "follow" parameter is not required if you're doing it this way, but if you include relationships, you'll want your assets to follow any relationships between tables included in those assets.
To do that, add the "follow" parameter at the same level as the "rootTable" parameter, and set it with a list of relationships in square brackets (highlighted below):
"assets": [{
"name": "asset1",
"tables": [{
"name": "table1",
"columns": []
}],
"rootTable": "table1",
"rootColumn": "col1",
"follow": ["relation1", "relation2"]
}]
"relationships": [{
"name": "relation1",
"from": {
"table": "table1",
"column": "col1"
},
"to": {
"table": "table2",
"column": "col1"
}
},
{
"name": "relation 2",
"from": {
"table": "table1",
"column": "col2"
},
"to": {
"table": "table2",
"column": "col2"
}
}
]