How to create a TDR dataset with APIs

As an alternate to using the TDR GUI, once you have your dataset schema (see Defining your TDR dataset schema), you can specify the schema and create the dataset in TDR using the Swagger APIs following the step-by-step directions below.

If you prefer to use the TDR website, see How to create a dataset on the TDR website. This might be a good option if you are not comfortable working with APIs and complex JSONs.

Step 1. Create a schema

Before you start this process, you should write your dataset's schema in JSON. See Overview: Defining your TDR dataset schema and How to write a TDR Dataset Schema for more details and examples.

Step 2. Create the dataset in Swagger (APIs)

Remember to authorize Swagger every time you use itSee How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

Use the createDataset API endpoint to create a new TDR dataset.

createDataset parameters

cloudPlatform: You can set your cloudPlatform to "gcp" or "azure." If you're using a Google billing account for your TDR billing profile ID, set your dataset's platform to gcp. If you're using an Azure-backed TDR billing profile, set your platform to azure.
defaultProfileId: You'll need at least one Billing profile ID, but you can include additional Billing profile IDs, if you want to allow for sharding file storage across billing accounts.
schema: You'll need to work out the code for your schema so that you can nest that code in the "schema" parameter.
region: You can optionally include the storage region for the dataset, if you want the data and metadata stored somewhere other than the default region.
enableSecureMonitoring: You can optionally set up secure monitoring for your dataset, to log all data access requests. Logs will be saved to wherever your data are staged in the cloud (e.g., a google bucket).
See the "schema" section of the Swagger documentation for a complete list of parameters.

createDataset request body

{
  "cloudPlatform": "gcp", /* or "azure" */
  "name": "dataset_name",
  "region": "us-central1",
  "description": "string",
  "defaultProfileId": "/* the profile id you generated when you created your billing profile */",
  "enableSecureMonitoring": /* set to true if you want to log all requests to access your dataset.*/
  "schema": { /* A schema model such as the schema shown in this article*/ } 
}

Tracking your Dataset creation and retrieving its information

Successfully submitting your request to create the dataset is also called successfully submitting a "job".

Successful submissions: What to expect

You'll see a response code below the "Execute" button (successful response codes are codes 200-202), and this response code will contain an "id" field . This is the job's ID, and you can use it to track the completion of this API request. The same is true for many other types of tasks done via the API - they launch jobs, and those jobs have their own job IDs. The progress of any such job can be tracked using the retrieveJob API endpoint in the Jobs section of the Swagger page.

Screenshot showing the 'jobs' section of TDR's API endpoints, accessible through the Swagger UI. Red rectangles highlight the retrieveJob and retrieveJobResult endpoints in this setcion.

Once the job has finished running, you can use the retrieveJobResult endpoint in the repository section to retrieve the job’s information. If the job failed, the returned result will describe the errors that caused the failure. If the job succeeded, the result will describe the new TDR dataset. The “id” field of this result is the UUID of the dataset and this is a required parameter in all future API calls affecting the new dataset.

Finding the dataset's unique UUID

You may find it convenient that the UUID, which is unique to any given dataset, can also be found in the URL bar when you're viewing the data set through the Data Repo UI at data.terra.bio:

Screenshot showing how to find a dataset's UUID through the TDR web interface. A red rectangle highlights the portion of the URL that contains the UUID, when a user has selected a specific dataset through the TDR website.

How to update a dataset's schema

You can update the schema of an existing dataset using the updateSchema API endpoint. The endpoint currently supports adding tables, adding non-required columns to an existing table, and adding relationships to an existing dataset.

Currently, you cannot delete or rename a table; just add.

The endpoint requires a description of the change and a list of changes to apply.

updateSchema request body (add new table and column)

{  
  "description": "Adding a table and column",
  "changes": {
    "addTables": [...],
    "addColumns": [...]
  }
}

Add new tables

The following is an example API request payload to add a new table. Note the items in "addTables" follow the same format as the "tables" in the dataset schema definition.

updateSchema request body (add new table)

{
  "description": "Adding a table",
  "changes": {
    "addTables": [
      {
        "name": "project",
        "columns": [
          {
            "name": "id",
            "datatype": "string",
            "required": true
          },
          {
            "name": "collaborators",
            "datatype": "string",
            "array_of": true
          }
        ],
       "primaryKey": ["id"]
      }
    ]
  }
}

Adding columns to existing tables

The following is an example API request payload to add new columns to existing tables. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request:

updateSchema request body (add columns to an existing table)

{
  "description": "Adding columns to existing tables",
  "changes": {
    "addColumns": [
      {
        "tableName": "bam_file,
        "columns": [
          {
            "name": "size",
            "datatype": "integer"
          }
        ]
      },
      {
        "tableName": "participant, 
        "columns": [ 
           { 
             "name": "age", 
             "datatype": "integer" 
           },
           {
             "name": "weight",
             "datatype": "integer" 
           }
        ]
      }
    ]
  }
}

Adding relationships to an existing dataset

The following is an example API request payload to add relationships to existing tables in a dataset. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request.

updateSchema request body (add relationships to an existing dataset)

{
  "description": "Adding relationships to existing tables",
  "changes": {
    "addRelationships": [
      {
        "name": "string",
        "from": {
            "table": "string",
            "column": "string"
          },
         "to": { 
            "table": "string", 
            "column": "string" 
        }
      }
    ]
  }
}