Setting schema for a dataset

Anton Kovalsky
  • Updated

This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI. Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button. You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

Screen_Shot_2021-09-23_at_9.16.08_PM.png


 

1. Background

Every dataset in the Terra Data Repo must declare a schema for its tabular data. Schemas consist of two main components:

  1. Tables: Names and types for BigQuery tables and columns
  2. Relationships: Links between columns

Creating a dataset requires writing out a .JSON that describes the schema for that dataset - this is the part of the createDataset request that describes what tables will be present in the dataset, and what data categories (i.e., what columns) will be present in each column, as well as any relationships between those columns.

This will mean inputting a nest .JSON into the parameters field of the createDataset Swagger API endpoint, with your schema .JSON object nested in the “schema” section of the request body as shown in the article on creating datasets. Getting all of the brackets just right in a nested .JSON like this can sometimes be a little tricky if you mess up the placement of a comma or a bracket. You might find a free online .JSON validator can make the task simpler.

 

1.1. Tables

Tables are declared as JSON objects, with keys for:

  • A name with a max length of 63 characters (matching the regex '^[a-zA-Z0-9][_a-zA-Z0-9]*$')
  • An optional partitioning “mode”, with corresponding settings
  • A list of columns
  • An optional list of primary-key column names
    • If a column’s name is included in the “primaryKeys” list of its parent table, it will be mapped to a BigQuery column with a REQUIRED mode. Otherwise, the column will be mapped to a NULLABLE mode.
    • Important note: Setting column names as "primaryKeys" will not work if the column's "array_of" is set to "true", since primary keys can't be arrays.

The name chosen for a table will pass through directly to be the name of the corresponding BigQuery table. Tables can be partitioned in BigQuery to reduce query runtime and cost. The top-level “partitionMode” key in each table model specifies one of:

  • “none”, to create an unpartitioned table
  • “date”, to create a table partitioned by either ingest date or the values in a date column
  • “int”, to create a table partitioned by the values in an integer column

BigQuery best practices suggest always partitioning your tables, using the ingest date if there is no other meaningful choice in the table’s columns. These options map directly to corresponding BigQuery partition settings, see their docs for more details.

  • If the “date” mode is specified, “datePartitionOptions” must also be specified. The options are a JSON object with a single key for “column”. The value of that key must be either “datarepo_ingest_date” (to partition by ingest date), or the name of a column in the table with datatype DATE or TIMESTAMP.
  • If the “int” mode is specified, “intPartitionOptions” must also be specified. The options are a JSON object with four keys:
    • “column”: The name of an INT64 or INTEGER column in the table
    • “min”: The smallest value in the column that should be partitioned
    • “max”: The largest value in the column that should be partitioned
    • “interval”: The range-size to use when dividing values between “min” and “max” into partitions

 

1.2. Columns

Table columns are also represented as JSON objects, with keys for:

  • A name, matching the same restrictions as table names
  • A data-type
  • An “array_of” boolean

Any column with a true value for “array_of” will be mapped to a BigQuery column with a REPEATED mode.

Most TDR types “pass through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

 

1.3. Data Types

These data types are set on a per column basis, so they're set in each column block individually:

"columns": [{
"name": "sample_id",
"datatype": "string",
"array_of": false
},
{
"name": "BAM_File_Path",
"datatype": "fileref",
"array_of": false
}
]

One of the most important data types to know is fileref. This is the data type you must set in the schema if you want that column's cells to be rendered as links to Cloud file paths when you export the snapshot to the data tab of your Terra workspace.

Most TDR types “pass through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

 

TDR Datatype

BigQuery Type

Example

BOOLEAN

BOOLEAN

TRUE and FALSE (add info about case sensitivity)

BYTES

BYTES

Variable length binary data

DATE

DATE

'YYYY-[M]M-[D]D'

4-digit year, 1 or 2 digit month and 1 or 2 digit date

DATETIME

DATETIME

YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]

 (more details here)

TIME

TIME

[H]H:[M]M:[S]S[.DDDDDD|.F]

(more details here)

TIMESTAMP

TIMESTAMP

Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]

(more details here)

FLOAT

FLOAT

 

FLOAT64

FLOAT

 

INTEGER

INTEGER

 

INT64

INTEGER

 

NUMERIC

NUMERIC

 

STRING

STRING

 

TEXT

STRING

 

FILEREF

STRING

 

DIRREF

STRING

 

 

 

2. Example

The .JSON object below would generate a dataset with two tables, each having two columns. This is defined in the “tables” section of the .JSON – with each individual table being described by a set of columns – and the relationship between the matching columns set in the “relationship” section.

 

Screen_Shot_2021-09-21_at_6.57.47_AM.png

{
"schema": {
"tables": [{
"name": "bam_file",
"columns": [{
"name": "sample_id",
"datatype": "string",
"array_of": false
},
{
"name": "BAM_File_Path",
"datatype": "fileref",
"array_of": false
}
],
"primaryKey": [],
"partitionMode": "none",
"datePartitionOptions": null,
"intPartitionOptions": null,
"rowCount": null
},
{
"name": "participant",
"columns": [{
"name": "sample_id",
"datatype": "string",
"array_of": false
},
{
"name": "phenotype",
"datatype": "string",
"array_of": false
}
],
"primaryKey": [],
"partitionMode": "none",
"datePartitionOptions": null,
"intPartitionOptions": null,
"rowCount": null
}
],
"relationships": [{
"name": "samples",
"from": {
"table": "participant",
"column": "sample_id"
},
"to": {
"table": "bam_file",
"column": "sample_id"
}
}]
}
}

 

This .JSON object can be pasted into the "schema" parameter field of the .JSON used for dataset creation. Again, using a free online .JSON validator can be quite helpful when writing out the full .JSON. If you're struggling to create a valid .JSON, it may help to copy-paste the example code in the Swagger UI request body for that particular API, and make changes to the template incrementally while validating each change.

 

Note: certain parameters, such as "tables", "relationships", and "assets" are expected to be lists, so make sure you include square brackets: [ ] 

 

3. Retrieving the schema of an existing dataset

You can view the .JSON for the schema of any dataset to which you have access by using the retrieveDataset API endpoint. If you select "SCHEMA" in the include menu, the response body will contain only the schema for the dataset.

 

2021-09-21_09-27-28.png

 

4. Updating the schema

You can update the schema of an existing dataset using the updateSchema API endpoint. The endpoint currently supports adding tables and adding non-required columns to an existing table. 

 

The endpoint requires a description of the change being made along with a list of changes to apply. The changes can include adding new tables, new columns, or both:

{  
  "description": "Adding a table and column",
  "changes": {
    "addTables": [...],
    "addColumns": [...]
  }
}

 

Adding new tables

The following is an example API request payload to add a new table. Note, the items in "addTables" follow the same format as the "tables" in the dataset schema definition.
{
  "description": "Adding a table",
  "changes": {
    "addTables": [
      {
        "name": "project",
        "columns": [
          {
            "name": "id",
            "datatype": "string",
            "required": true
          },
          {
            "name": "collaborators",
            "datatype": "string",
            "array_of": true
          }
        ],
       "primaryKey": ["id"]
      }
    ]
  }
}

Adding columns to existing tables

The following is an example API request payload to add new columns to existing tables. Note that the new columns cannot be set to required. Multiple tables can be updated in the same request:

{
  "description": "Adding columns to existing tables",
  "changes": {
    "addColumns": [
      {
        "tableName": "bam_file,
        "columns": [
          {
            "name": "size",
            "datatype": "integer"
          }
        ]
      },
{
"tableName": "participant,
"columns": [
{
"name": "age",
"datatype": "integer"
},
{
"name": "weight",
"datatype": "integer"
}
]
} ] } }

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.