How to create a TDR dataset with APIs

Leyla Tarhan
  • Updated

If you're interested in using Terra on Azure, please email terra-enterprise@broadinstitute.org.

As an alternate to using the TDR GUI, once you have your dataset schema (see Defining your TDR dataset schema), you can specify the schema and create the dataset in TDR using the Swagger APIs following the step-by-step directions below.  

If you prefer to use the TDR website, see How to create a dataset on the TDR website. This might be a good option if you are not comfortable working with APIs and complex JSONs. 

Step 1. Create a schema

Before you start this process, you should write your dataset's schema in JSON. See Overview: Defining your TDR dataset schema and How to write a TDR Dataset Schema for more details and examples.

Step 2. Create the dataset in Swagger (APIs)

Remember to authorize Swagger every time you use itSee How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

createDataset parameters

  • cloudPlatform: You can set your cloudPlatform to "gcp" or "azure." If you're using a Google-backed TDR billing profile, set your dataset's platform to gcp. If you're using an Azure-backed TDR billing profile, set your platform to azure.
  • defaultProfileId: You'll need at least one Billing profile ID, but you can include additional Billing profile IDs, if you want to allow for sharding file storage across billing accounts.
  • schema: You'll need to work out the code for your schema so that you can nest that code in the "schema" parameter. 
  • region: You'll need to include the dataset's storage region to specify where the files and metadata are stored. 
    • We recommend using one of the following regions for Azure-backed TDR datasets:
      "brazilsouth"
      "australiaeast"
      "eastus"
      "swedencentral"
      "westeurope"
      "koreacentral"
      "eastus2"
      westeurope
      koreacentral
      eastus2
      westus2
      qatarcentral
      southcentralus
      germanywestcentral
      francecentral
      norwayeast
      southafricanorth
      japaneast
      westus3
      uksouth
      northeurope
      eastasia
      southeastasia
      canadacentral
      centralus
      centralindia
      switzerlandnorth
      uaenorth
  • enableSecureMonitoring: You can optionally set up secure monitoring for your dataset, to log all data access requests. Logs will be saved to wherever your data are staged in the cloud (e.g., an Azure storage container).
  • See the "schema" section of the Swagger documentation for a complete list of parameters.

createDataset request body

{
"cloudPlatform": "azure", /* or "gcp" */
"name": "dataset_name",
"region": "centralus",
"description": "string",
"defaultProfileId": "/* the profile id you generated when you created your billing profile */",
"enableSecureMonitoring": /* set to true if you want to log all requests to access your dataset.*/
"schema": { /* A schema model such as the schema shown in this article*/ }
}

Tracking your Dataset creation and retrieving its information

Successfully submitting your request to create the dataset is also called successfully submitting a "job".

Successful submissions: What to expect

You'll see a response code below the "Execute" button (successful response codes are codes 200-202), and this response code will contain an "id" field . This is the job's ID, and you can use it to track the completion of this API request. The same is true for many other types of tasks done via the API - they launch jobs, and those jobs have their own job IDs. The progress of any such job can be tracked using the retrieveJob API endpoint in the Jobs section of the Swagger page.

Screenshot showing the 'jobs' section of TDR's API endpoints, accessible through the Swagger UI. Red rectangles highlight the retrieveJob and retrieveJobResult endpoints in this setcion.

Once the job has finished running, you can use the retrieveJobResult endpoint in the repository section to retrieve the job’s information. If the job failed, the returned result will describe the errors that caused the failure. If the job succeeded, the result will describe the new TDR dataset. The “id” field of this result is the UUID of the dataset and this is a required parameter in all future API calls affecting the new dataset. You can also find this UUID in the dataset's summary tab on the Data Repo UI at data.terra.bio.

How to update a dataset's schema

You can update the schema of an existing dataset using the updateSchema API endpoint. The endpoint currently supports adding tablesadding non-required columns to an existing table, and adding relationships to an existing dataset.

Currently, you cannot delete or rename a table; just add. 

The endpoint requires a description of the change and a list of changes to apply

updateSchema request body (add new table and column)

{  
  "description": "Adding a table and column",
  "changes": {
    "addTables": [...],
    "addColumns": [...]
  }
}

Add new tables

The following is an example API request payload to add a new table. Note the items in "addTables" follow the same format as the "tables" in the dataset schema definition.

updateSchema request body (add new table)

{
  "description": "Adding a table",
  "changes": {
    "addTables": [
      {
        "name": "project",
        "columns": [
          {
            "name": "id",
            "datatype": "string",
            "required": true
          },
          {
            "name": "collaborators",
            "datatype": "string",
            "array_of": true
          }
        ],
       "primaryKey": ["id"]
      }
    ]
  }
}

Adding columns to existing tables

The following is an example API request payload to add new columns to existing tables. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request:

updateSchema request body (add columns to an existing table)

{
  "description": "Adding columns to existing tables",
  "changes": {
    "addColumns": [
      {
        "tableName": "bam_file,
        "columns": [
          {
            "name": "size",
            "datatype": "integer"
          }
        ]
      },
{
"tableName": "participant,
"columns": [
{
"name": "age",
"datatype": "integer"
},
{
"name": "weight",
"datatype": "integer"
}
]
} ] } }

Adding relationships to an existing dataset

The following is an example API request payload to add relationships to existing tables in a dataset. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request.

updateSchema request body (add relationships to an existing dataset)

{
  "description": "Adding relationships to existing tables",
  "changes": {
    "addRelationships": [
      {
        "name": "string",
        "from": {
            "table": "string",
            "column": "string"
          },
"to": {
"table": "string",
"column": "string"
}
} ] } }

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.