(option2) Create a TDR dataset with APIs

Anton Kovalsky
  • Updated

As an alternate to using the TDR GUI, once you have your dataset schema (see Defining your TDR dataset schema), you can specify the schema and create the dataset in TDR using the Swagger APIs following the step-by-step directions below.  

Before you start

Before you start this process, you should have an outline of your data model in mind. See Defining your TDR dataset schema for more details and examples.

Things to establish

  • What tables are needed to contain your data, and how are they related?
  • What is the “root entity” table?
    The root entity table is the table that includes the primary input data for your dataset. In Terra, this often refers to data that will be used as inputs for a workflow.

Making data findable with a standardized schema

The more alike each dataset's schema is, the easier it will be for other people to find useful data in the data repo. Below are some references to help with common ontology and organization.

See GA4GH Data Use Ontology

Step 1. Generate the schema JSON

The steps and examples below show how to write out a nested JavaScript Object Notation file (JSON) that describes the schema for that dataset - the tables and columns - as well as any relationships between those columns.

.JSON components

  1. Tables: Names and types for BigQuery tables and columns
  2. Relationships: Links between columns

Troubleshooting complex JSON formatting

Getting all of the brackets just right in a nested .JSON like this can be a little tricky! To avoid messing up the placement of a comma or a bracket, try a free online .JSON validator.

Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.

Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.

You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

1. Tables

Tables are declared as JSON objects with keys for

  • A name with a max length of 63 characters (matching the regex '^[a-zA-Z0-9][_a-zA-Z0-9]*$')
  • An optional partitioning “mode” with corresponding settings
  • A list of columns
  • An optional list of primary-key column names
    • If a column’s name is included in the “primaryKeys” list of its parent table, it will be mapped to a BigQuery column with a REQUIRED mode. Otherwise, the column will be mapped to a NULLABLE mode.
    • Important note: Setting column names as "primaryKeys" will not work if the column's "array_of" is set to "true", since primary keys can't be arrays.

The specified table name will be the name of the corresponding BigQuery table.

Reduce query runtime and cost by partitioning large tables

Tables can be partitioned in BigQuery to reduce query runtime and cost. The top-level “partitionMode” key in each table model specifies one of:

  • “none”, to create an unpartitioned table
  • “date”, to create a table partitioned by either ingest date or the values in a date column
  • “int”, to create a table partitioned by the values in an integer column

BigQuery best practices suggest always partitioning your tables, using the ingest date if there is no other meaningful choice in the table’s columns. These options map directly to corresponding BigQuery partition settings; see the BigQuery docs for more details.

  • If the “date” mode is specified, “datePartitionOptions” must also be specified. The options are a JSON object with a single key for “column”. The value of that key must be either “datarepo_ingest_date” (to partition by ingest date), or the name of a column in the table with datatype DATE or TIMESTAMP.
  • If the “int” mode is specified, “intPartitionOptions” must also be specified. The options are a JSON object with four keys:
    • “column”: The name of an INT64 or INTEGER column in the table
    • “min”: The smallest value in the column that should be partitioned
    • “max”: The largest value in the column that should be partitioned
    • “interval”: The range-size to use when dividing values between “min” and “max” into partitions

2. Columns

Table columns are also represented as JSON objects, with keys for the following.

JSON object keys

  • A name, with the same restrictions as table names
  • A data-type (e.g., string, number, boolean)
  • An “array_of” boolean
  • A "required" boolean option

Any column with a true value for “array_of” will be mapped to a BigQuery column with a REPEATED mode.

3. Data Types

Data types are set per column (i.e., separately for each column block).

Example JSON (define table columns)

"columns": [{
"name": "sample_id",
"datatype": "string",
"array_of": false
},
{
"name": "BAM_File_Path",
"datatype": "fileref",
"array_of": false
}
]

Setting the data type of strings that are links to files (URIs)

To render a column's cells as links to cloud file paths when you export the snapshot to the data tab of your Terra workspace, you must set the data type to fileref in the schema.

Data types in TDR, BigQuery, and Azure Synapse

When creating a dataset in TDR, you will need to supply the data type for each column. Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

Use the table below to help guide your choices. 

  • Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

    TDR Datatype

    BigQuery Type

    Synapse Type

    Examples/
    Warnings

    BOOLEAN

    BOOLEAN

    BIT

    TRUE and FALSE

    BYTES

    BYTES

    VARBINARY

    Variable length binary data

    DATE

    DATE

    DATE

    'YYYY-[M]M-[D]D'

    4-digit year, 1 or 2-digit month, and 1- or 2-digit date

    DATETIME

    DATETIME

    DATETIME2

    YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]

    Note: Datetime and Time data types do not care about timezone. BQ stores and returns them in the format provided.

    TIME

    TIME

    TIME

    [H]H:[M]M:[S]S[.DDDDDD|.F]

    Note: TDR currently only accepts timestamps in timezone UTC. BQ stores this value as a long. In the UI, we do the conversion to UTC timestamp. However, the result from the previous data endpoint is a long value. If you are directly using our endpoint, you will have to perform this conversion to have an understandable value.

    TIMESTAMP

    TIMESTAMP

    DATETIME2

    Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]

    FLOAT

    FLOAT

    FLOAT

    Float and Float64 point to the same underlying data types, so they are equivalent.

    FLOAT64

    FLOAT

    FLOAT

     

    INTEGER

    INTEGER

    INT

     

    INT64

    INTEGER

    BIGINT

     

    NUMERIC

    NUMERIC

    REAL

    For very large float data or for data where calculations will be performed on the data.

    STRING

    STRING

    varchar(8000)

     

    TEXT

    STRING

    varchar(8000)

     

    FILEREF

    STRING

    varchar(36)

    Stores UUIDs that map to an ingested file. This is translated to DRS URLS on snapshot create.

    DIRREF

    STRING

    varchar(36)

     

Helpful hints for creating a valid JSONUsing a free online JSON validator can be quite helpful when writing out the full JSON. If you're struggling to create a valid JSON, it may help to copy-paste the example code in the Swagger UI request body for that particular API. You can then make changes to the template incrementally while validating each change.

Note: certain parameters - such as "tables", "relationships", and "assets" - are expected to be lists, so make sure you include square brackets: [ ]

Example JSON object

The JSON object below generates a dataset with two tables, a "subject" table with two columns, and a "sample" table with three columns. The tables and columns are defined in the “tables section of the JSON. The relationship between the matching columns is set in the relationship section.

TDR_Relationship-between-subject-table-and-sample-table.png

Example JSON - Generate two tables connected by matching columns

{
"schema": {
"tables": [{
"name": "sample",
"columns": [{
"name": "sample_id",
"datatype": "string",
"array_of": false
},
{
"name": "BAM_File_path",
"datatype": "fileref",
"array_of": false
}
{
"name": "subject_id",
"datatype": "string",
"array_of": false
}
],
"primaryKey": [],
"partitionMode": "none",
"datePartitionOptions": null,
"intPartitionOptions": null,
"rowCount": null
},
{
"name": "subject",
"columns": [{
"name": "subject_id",
"datatype": "string",
"array_of": false
},
{
"name": "phenotype",
"datatype": "string",
"array_of": false
}
],
"primaryKey": [],
"partitionMode": "none",
"datePartitionOptions": null,
"intPartitionOptions": null,
"rowCount": null
}
],
"relationships": [{
"name": "subject",
"from": {
"table": "subject",
"column": "subject_id"
},
"to": {
"table": "sample",
"column": "subject_id"
}
}]
}
}

This .JSON object can be pasted into the "schema" parameter field of the .JSON used for dataset creation.

Helpful hints for creating a valid JSONUsing a free online .JSON validator can be pretty helpful when writing out the full .JSON. Suppose you're struggling to create a valid .JSON. In that case, it may help to copy-paste the example code in the Swagger UI request body for that particular API and incrementally make changes to the template while validating each change.

Note: certain parameters - such as "tables", "relationships", and "assets" - are expected to be lists. You'll want to make sure you include square brackets: [ ] 

Step 2. Create the dataset in Swagger (APIs) Dataset creation

Use the createDataset API endpoint.

Remember to authorize Swagger every time you use itClick “Authorize” near the top of the page, check all four boxes (including the last one about billing, which may not be checked by default) in the pop-up, and hit “Authorize” again. Then input the appropriate credentials to authenticate. Make sure you close the subsequent pop-up without clicking the “Sign Out” button.

You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see How to authenticate/troubleshoot Swagger.

createDataset parameters

  • You'll need at least one Billing profile ID, but you can include additional Billing profile IDs, if you want to allow for sharding file storage across billing accounts.
  • You can include the storage region for the dataset, if you want the data and metadata stored somewhere other than the default region.
  • You'll need to work out the code for your schema so that you can nest that code in the "schema" parameter. 

createDataset request body

{
"cloudPlatform": "gcp",
"name": "dataset_name"
  "region": "us-central1"
"description": "string",
"defaultProfileId": "/* the profile id you generated when you created your billing profile */",
"schema": { /* A schema model such as the schema shown in this article*/ }
}

Tracking your Dataset creation and retrieving its information

Successfully submitting your request to create the dataset is also called successfully submitting a "job".

Successful submissions: What to expect

You'll see a response code below the "Execute" button (successful response codes are codes 200-202), and this response code will contain an "id" field . This is the job's ID, and you can use it to track the completion of this API request. The same is true for many other types of tasks done via the API - they launch jobs, and those jobs have their own job IDs. The progress of any such job can be tracked using the retrieveJob API endpoint in the Jobs section of the Swagger page.

2021-09-21_06-41-12.png

Once the job has finished running, you can use the retrieveJobResult endpoint in the repository section to retrieve the job’s information. If the job failed, the returned result will describe the errors that caused the failure. If the job succeeded, the result will describe the new TDR dataset. The “id” field of this result is the UUID of the dataset and this is a required parameter in all future API calls affecting the new dataset.

Finding the dataset's unique UUID

You may find it convenient that the UUID, which is unique to any given dataset, can also be found in the URL bar when you're viewing the data set through the Data Repo UI at data.terra.bio:

2021-09-21_06-51-01.png

How to retrieve the schema of an existing dataset

You can view the .JSON for the schema of any dataset to which you have access using the retrieveDataset API endpoint. If you select "SCHEMA" in the include menu, the response body will contain only the schema for the dataset.

2021-09-21_09-27-28.png

How to update a dataset's schema

You can update the schema of an existing dataset using the updateSchema API endpoint. The endpoint currently supports adding tablesadding non-required columns to an existing table, and adding relationships to an existing dataset.

Currently, you cannot delete or rename a table; just add. 

The endpoint requires a description of the change and a list of changes to apply

updateSchema request body (add new table and column)

{  
  "description": "Adding a table and column",
  "changes": {
    "addTables": [...],
    "addColumns": [...]
  }
}

Add new tables

The following is an example API request payload to add a new table. Note the items in "addTables" follow the same format as the "tables" in the dataset schema definition.

updateSchema request body (add new table)

{
  "description": "Adding a table",
  "changes": {
    "addTables": [
      {
        "name": "project",
        "columns": [
          {
            "name": "id",
            "datatype": "string",
            "required": true
          },
          {
            "name": "collaborators",
            "datatype": "string",
            "array_of": true
          }
        ],
       "primaryKey": ["id"]
      }
    ]
  }
}

Adding columns to existing tables

The following is an example API request payload to add new columns to existing tables. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request:

updateSchema request body (add columns to an existing table)

{
  "description": "Adding columns to existing tables",
  "changes": {
    "addColumns": [
      {
        "tableName": "bam_file,
        "columns": [
          {
            "name": "size",
            "datatype": "integer"
          }
        ]
      },
{
"tableName": "participant,
"columns": [
{
"name": "age",
"datatype": "integer"
},
{
"name": "weight",
"datatype": "integer"
}
]
} ] } }

Adding relationships to an existing dataset

The following is an example API request payload to add relationships to existing tables in a dataset. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request.

updateSchema request body (add relationships to an existing dataset)

{
  "description": "Adding relationships to existing tables",
  "changes": {
    "addRelationships": [
      {
        "name": "string",
        "from": {
            "table": "string",
            "column": "string"
          },
"to": {
"table": "string",
"column": "string"
}
} ] } }

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.