How to create a dataset on the TDR website

Anton Kovalsky
  • Updated

Before you can ingest data into the data repo, you’ll need to create a Dataset into which you’ll ingest the data. Learn how in the step-by-step instructions below. 

If you prefer to use Swagger, see How to create a TDR dataset with APIs. This might be a good option if you are comfortable with APIs and complex JSONs. 

Step-by-step instructions

Start by logging into data.terra.bio and clicking the Create Dataset button. 

TDR_Create-Dataset-button-in-UI_Screenshot.png

Step 1. Submit dataset information

In the intro form, complete the required fields and drop-downs for the dataset.

Required fields

  • Name (note that the name can only include letters, numbers, and underscores)
  • Cloud Platform (if you're using a Google-backed TDR billing profile, choose Google Cloud Platform. If you're using an Azure-backed TDR billing profile, choose Microsoft Azure).
  • Billing Profile (this is the name of the billing profile you created in the previous step)
  • Region (note that Terra's default region is us-central1)

You can also add a description, designate stewards and custodians, and choose secure monitoring. Secure monitoring logs all requests to access your data, and saves these logs to the same location where your data are staged to the cloud (e.g., a Google bucket).

Step 2. Build schema in TDR

Instructions in the browser walk through creating the dataset schema, table-by-table.

2.1. Use the blue button on the left to create a table. Repeat for each table in your schema. 

TDR_Create-dataset-schema_Step-2-Create-a-table_Screenshot.png

2.2. Use the second blue button to add columns to each table. You will select the column name and datatype. Repeat for each attribute (column) in each table. 

TDR_Create-dataset-schema_Step-2-Create-a-table_Screenshot.png

Data types (TDR, BigQuery, Azure Synapse)

When creating a dataset in TDR, you will need to supply the data type for each column. Use the table below (click to expand) will help guide your choices. 

  • Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

    TDR Datatype

    BigQuery Type

    Synapse Type

    Examples/
    Warnings

    BOOLEAN

    BOOLEAN

    BIT

    TRUE and FALSE

    BYTES

    BYTES

    VARBINARY

    Variable length binary data

    DATE

    DATE

    DATE

    'YYYY-[M]M-[D]D'

    4-digit year, 1 or 2-digit month, and 1- or 2-digit date

    DATETIME

    DATETIME

    DATETIME2

    YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]

    Note: Datetime and Time data types do not care about timezone. BQ stores and returns them in the format provided.

    TIME

    TIME

    TIME

    [H]H:[M]M:[S]S[.DDDDDD|.F]

    Note: TDR currently only accepts timestamps in timezone UTC. BQ stores this value as a long. In the UI, we do the conversion to UTC timestamp. However, the result from the previous data endpoint is a long value. If you are directly using our endpoint, you will have to perform this conversion to have an understandable value.

    TIMESTAMP

    TIMESTAMP

    DATETIME2

    Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]

    FLOAT

    FLOAT

    FLOAT

    Float and Float64 point to the same underlying data types, so they are equivalent.

    FLOAT64

    FLOAT

    FLOAT

     

    INTEGER

    INTEGER

    INT

     

    INT64

    INTEGER

    BIGINT

     

    NUMERIC

    NUMERIC

    REAL

    For very large float data or for data where calculations will be performed on the data.

    STRING

    STRING

    varchar(8000)

     

    TEXT

    STRING

    varchar(8000)

     

    FILEREF

    STRING

    varchar(36)

    Stores UUIDs that map to an ingested file. This is translated to DRS URLS on snapshot create.

    DIRREF

    STRING

    varchar(36)

     

2.3. Scroll down to the JSON view. It may be useful to copy the entire content (screenshot below) somewhere safe, for record-keeping.

You can also use this JSON as-is to create your dataset using APIs. See How to create a TDR dataset with APIs

TDR_Create-dataset-schema_JSON-view_Screenshot.png

2.4. Click the blue Submit button to generate your dataset. 

What to expect

You will get a note if your dataset creation fails for any reason. The most often cause of failure is an incorrect attribute name (attributes can only contain lowercase letters and underscores). If your dataset creates successfully, you can move on to the next step, ingestion!!

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.