How to create dataset assets in TDR

Step-by-step instructions for creating dataset assets, the last step before subsetting data into snapshots.

Overview

Once a dataset has been created and populated with ingested data, the last step before the data can be subsetted into snapshots is asset creation. The asset creation step is an access control that enables the data custodian to specify which columns from their dataset are available to individuals creating snapshots.

Snapshots can only be created from datasets with at least one asset, and each snapshot contains one asset.

Diagram schematizing a TDR dataset asset. The dataset contains one data table with 5 columns. Blue rectangles highlight three of the columns, which make up the asset.

What to expect after creating assets

You'll be able to see a list of existing assets in the Terra Data Repo UI by going to a dataset, clicking the three-dash triangle logo near the top right of the screen, and then clicking "Create Snapshot". This will call up an "Assets" dropdown that lists all the assets available for that dataset. If a dataset has no assets, the "Create Snapshot" button will be greyed out.

Screen recording showing how to begin creating a snapshot. Once you've created an asset, it will be visible from the 'assets' dropdown menu in the snapshot creation process.

When creating snapshots (i.e., subsets of data), anyone added to a dataset will always see all assets for which they are authorized to access.

Sharing data with snapshots

Once a dataset has an asset, a user who has access to that dataset can create a snapshot by selecting an asset and specifying which rows they want included in their snapshot.

Diagram schematizing a TDR dataset with an asset and a snapshot. The dataset consistent of one data table with 5 columns. Blue rectangles highlight 3 of these columns, which comprise an asset. Orange rectangles highlight individual rows (queries). Green rectangles highlight individual cells at the overlap between the asset and the queries -- these represent the contents of the snapshot.

To allow access to one asset but not another, you would create a snapshot with that asset. Including all of the rows shares the full asset without sharing any other assets. The only columns included in the snapshot will be those specified in the asset used to create the snapshot.

Two ways to create assets

Option 1: Use the addDatasetAssetSpecification API endpoint
Option 2: Include the assets in your schema when creating a dataset

Option 1. Use the addDatasetAssetSpecification API endpoint

Use the addDatasetAssetSpecification API endpoint with the .JSON code below as the request body in the API. The UUID for the dataset to which you wish to add this asset is specified in a separate field in this API.

Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

addDatasetAssetSpecification request body

{
 "name": "asset3",
 "rootTable": "table1",
 "rootColumn": "column_1",
 "follow": [],
 "tables": [
    {
    "columns": [],
    "name": "table1"
    }
 ]
}

addDatasetAssetSpecification required parameters

name: A name to identify the asset.
rootTable: You must select a table from your dataset as the root table, even if you include multiple tables in the asset. The root table should be a table with data from all of the rows (for example, samples or subjects) that you plan to include in a data snapshot to share the data with other researchers. In addition, if you plan to include data from any other tables in a data snapshot, the root table should be connected to those tables via relationships specified in your dataset's schema.
rootColumn: The root column should be a column in the root table that you can use to filter your data when creating a data snapshot to share with other researchers. For example, if you plan to create a data snapshot with all data from a specific disease, your root column might be a disease_id column.
follow (optional): A list of the relationships that link the tables that contribute to the asset. If your asset only includes one table, you can set it to [] as shown above.

If your asset includes more than one table, you must specify the relationship(s) between the tables in this field. Otherwise, data snapshots made using this asset will only include data from the root table.

List relationships using the names specified in the dataset's schema, in order. For example, if your root table is table_1 and table_1 is linked to table_2, which is linked to table_3, your API call might include the line "follow": ["table_1_to_table_2", "table_2_to_table_3"].
tables: Indicates which tables and which columns to include in the asset. To include all columns in a table, set the columns field to [].

Option 2. Include assets in your schema

You can create your datasets with assets already present. The article How to create a dataset in TDR outlines how to use the createDataset API, and the article How to create a dataset schema in TDR shows what the .JSON code for a schema looks like.

To create your dataset with the assets already present, include the JSON object (highlighted in the example below) as part of your schema at the same level as your "tables" and "relationships" objects.

Example schema JSON

"schema": {
    "tables": [{
        "name": "table1",
        "columns": [{
        "name": "column_1",
        "datatype": "string"
        },
        {
        "name": "column_2",
        "datatype": "fileref"
        },
        {
        "name": "column_3",
        "datatype": "fileref"
        }
        ]
        }],
    "assets": [{
        "name": "asset1",
        "tables": [{
        "name": "table1",
        "columns": [
            "column_1",
            "column_2"
            ]
        }],
        "rootTable": "table1",
        "rootColumn": "column_1"
        },
        {
        "name": "asset2",
        "tables": [{
        "name": "table1",
            "columns": [
            "column_1",
            "column_3"
            ]
        }],
        "rootTable": "table1",
        "rootColumn": "column_1"
    }]
}

When creating datasets with preinstalled assets, don't forget that each asset needs to have a non-null value for the "rootTable" and "rootColumn" parameters. The "follow" parameter is not required if you're doing it this way, but if you include relationships, you'll want your assets to follow any relationships between tables included in those assets.

To do that, add the "follow" parameter at the same level as the "rootTable" parameter, and set it with a list of relationships in square brackets (highlighted below):

"assets": [{
    "name": "asset1",
    "tables": [{
        "name": "table1",
        "columns": []
    }],
    "rootTable": "table1",
    "rootColumn": "col1",
    "follow": ["relation1", "relation2"]
}]
"relationships": [{
    "name": "relation1",
    "from": {
        "table": "table1",
        "column": "col1"
        },
    "to": {
        "table": "table2",
        "column": "col1"
    }
},
    {
    "name": "relation 2",
    "from": {
        "table": "table1",
        "column": "col2"
        },
    "to": {
        "table": "table2",
        "column": "col2"
        }
    }
]