How to create a TDR dataset with APIs

Anton Kovalsky
  • Updated

As an alternate to using the TDR GUI, once you have your dataset schema (see Defining your TDR dataset schema), you can specify the schema and create the dataset in TDR using the Swagger APIs following the step-by-step directions below.  

If you prefer to use the TDR website, see How to create a dataset on the TDR website. This might be a good option if you are not comfortable working with APIs and complex JSONs. 

Step 1. Create a schema

Before you start this process, you should write your dataset's schema in JSON. See Overview: Defining your TDR dataset schema and How to write a TDR Dataset Schema for more details and examples.

Step 2. Create the dataset in Swagger (APIs)

Remember to authorize Swagger every time you use it- Click Authorize near the top of the page.
- Choose an authentication method: googleoauth or oidc.
- If you're running the createProfile endpoint to create a google-backed billing profile, use the googleoauth authentication and check all four boxes (including the last one about billing, which may not be checked by default).
- Otherwise, choose either authentication method, but not both.
- click Authorize again.
- Input the appropriate credentials.
- When you close the pop-up window, do not click Sign Out.

Use the createDataset API endpoint to create a new TDR dataset.

createDataset parameters

  • cloudPlatform: You can set your cloudPlatform to "gcp" or "azure." If you're using a Google billing account for your TDR billing profile ID, set your dataset's platform to gcp. If you're using an Azure-backed TDR billing profile, set your platform to azure.
  • defaultProfileId: You'll need at least one Billing profile ID, but you can include additional Billing profile IDs, if you want to allow for sharding file storage across billing accounts.
  • schema: You'll need to work out the code for your schema so that you can nest that code in the "schema" parameter. 
  • region: You can optionally include the storage region for the dataset, if you want the data and metadata stored somewhere other than the default region.
  • enableSecureMonitoring: You can optionally set up secure monitoring for your dataset, to log all data access requests. Logs will be saved to wherever your data are staged in the cloud (e.g., a google bucket).
  • See the "schema" section of the Swagger documentation for a complete list of parameters.

createDataset request body

{
"cloudPlatform": "gcp", /* or "azure" */
"name": "dataset_name",
"region": "us-central1",
"description": "string",
"defaultProfileId": "/* the profile id you generated when you created your billing profile */",
"enableSecureMonitoring": /* set to true if you want to log all requests to access your dataset.*/
"schema": { /* A schema model such as the schema shown in this article*/ }
}

Tracking your Dataset creation and retrieving its information

Successfully submitting your request to create the dataset is also called successfully submitting a "job".

Successful submissions: What to expect

You'll see a response code below the "Execute" button (successful response codes are codes 200-202), and this response code will contain an "id" field . This is the job's ID, and you can use it to track the completion of this API request. The same is true for many other types of tasks done via the API - they launch jobs, and those jobs have their own job IDs. The progress of any such job can be tracked using the retrieveJob API endpoint in the Jobs section of the Swagger page.

2021-09-21_06-41-12.png

Once the job has finished running, you can use the retrieveJobResult endpoint in the repository section to retrieve the job’s information. If the job failed, the returned result will describe the errors that caused the failure. If the job succeeded, the result will describe the new TDR dataset. The “id” field of this result is the UUID of the dataset and this is a required parameter in all future API calls affecting the new dataset.

Finding the dataset's unique UUID

You may find it convenient that the UUID, which is unique to any given dataset, can also be found in the URL bar when you're viewing the data set through the Data Repo UI at data.terra.bio:

2021-09-21_06-51-01.png

How to update a dataset's schema

You can update the schema of an existing dataset using the updateSchema API endpoint. The endpoint currently supports adding tablesadding non-required columns to an existing table, and adding relationships to an existing dataset.

Currently, you cannot delete or rename a table; just add. 

The endpoint requires a description of the change and a list of changes to apply

updateSchema request body (add new table and column)

{  
  "description": "Adding a table and column",
  "changes": {
    "addTables": [...],
    "addColumns": [...]
  }
}

Add new tables

The following is an example API request payload to add a new table. Note the items in "addTables" follow the same format as the "tables" in the dataset schema definition.

updateSchema request body (add new table)

{
  "description": "Adding a table",
  "changes": {
    "addTables": [
      {
        "name": "project",
        "columns": [
          {
            "name": "id",
            "datatype": "string",
            "required": true
          },
          {
            "name": "collaborators",
            "datatype": "string",
            "array_of": true
          }
        ],
       "primaryKey": ["id"]
      }
    ]
  }
}

Adding columns to existing tables

The following is an example API request payload to add new columns to existing tables. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request:

updateSchema request body (add columns to an existing table)

{
  "description": "Adding columns to existing tables",
  "changes": {
    "addColumns": [
      {
        "tableName": "bam_file,
        "columns": [
          {
            "name": "size",
            "datatype": "integer"
          }
        ]
      },
{
"tableName": "participant,
"columns": [
{
"name": "age",
"datatype": "integer"
},
{
"name": "weight",
"datatype": "integer"
}
]
} ] } }

Adding relationships to an existing dataset

The following is an example API request payload to add relationships to existing tables in a dataset. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request.

updateSchema request body (add relationships to an existing dataset)

{
  "description": "Adding relationships to existing tables",
  "changes": {
    "addRelationships": [
      {
        "name": "string",
        "from": {
            "table": "string",
            "column": "string"
          },
"to": {
"table": "string",
"column": "string"
}
} ] } }

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.