How to create a dataset on the TDR website

Leyla Tarhan
  • Updated

If you're interested in using Terra on Azure, please email terra-enterprise@broadinstitute.org.

Before you can ingest data into the data repo, you’ll need to create a Dataset into which you’ll ingest the data. Learn how in the step-by-step instructions below. 

If you prefer to use Swagger, see How to create a TDR dataset with APIs. This might be a good option if you are comfortable with APIs and complex JSONs. 

Step-by-step instructions

Start by logging into data.terra.bio and clicking the Create Dataset button. 

Screenshot showing the Create Dataset button on the Terra Data Repository homepage

Step 1. Submit dataset information

In the intro form, complete the required fields and drop-downs for the dataset.

Required fields

You can also add a description, designate stewards and custodians, and choose secure monitoring. Secure monitoring logs all requests to access your data, and saves these logs to the same location where your data are staged to the cloud (e.g., an Azure storage container).

Step 2. Build schema in TDR

The schema sets up your dataset's tables, including their columns, data types, and relationships between tables. Instructions on the TDR website walk you through creating the schema, table-by-table.

2.1. Use the blue button on the left to create a table. Repeat for each table in your schema. 

Screenshot showing the form used to create a TDR dataset schema on the TDR website.

2.2. Use the second blue button to add columns to each table. You will select the column name and datatype. Repeat for each attribute (column) in each table. 

Data types (TDR, BigQuery, Azure Synapse)

When creating a dataset in TDR, you will need to supply the data type for each column. Use the table below (click to expand) will help guide your choices. 

  • Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.

    TDR Datatype

    BigQuery Type

    Synapse Type

    Examples/
    Warnings

    BOOLEAN

    BOOLEAN

    BIT

    TRUE and FALSE

    BYTES

    BYTES

    VARBINARY

    Variable length binary data

    DATE

    DATE

    DATE

    'YYYY-[M]M-[D]D'

    4-digit year, 1 or 2-digit month, and 1- or 2-digit date

    DATETIME

    DATETIME

    DATETIME2

    YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]

    Note: Datetime and Time data types do not care about timezone. BQ stores and returns them in the format provided.

    TIME

    TIME

    TIME

    [H]H:[M]M:[S]S[.DDDDDD|.F]

    Note: TDR currently only accepts timestamps in timezone UTC. BQ stores this value as a long. In the UI, we do the conversion to UTC timestamp. However, the result from the previous data endpoint is a long value. If you are directly using our endpoint, you will have to perform this conversion to have an understandable value.

    TIMESTAMP

    TIMESTAMP

    DATETIME2

    Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]

    FLOAT

    FLOAT

    FLOAT

    Float and Float64 point to the same underlying data types, so they are equivalent.

    FLOAT64

    FLOAT

    FLOAT

     

    INTEGER

    INTEGER

    INT

     

    INT64

    INTEGER

    BIGINT

     

    NUMERIC

    NUMERIC

    REAL

    For very large float data or for data where calculations will be performed on the data.

    STRING

    STRING

    varchar(8000)

     

    TEXT

    STRING

    varchar(8000)

     

    FILEREF

    STRING

    varchar(36)

    Stores UUIDs that map to an ingested file. This is translated to DRS URLS on snapshot create.

    DIRREF

    STRING

    varchar(36)

     

2.3. Scroll down to the JSON view. It may be useful to copy the entire content (screenshot below) somewhere safe, for record-keeping.

You can also use this JSON as-is to create your dataset using APIs. See How to create a TDR dataset with APIs

Screenshot of the schema JSON for an example TDR dataset on the TDR website.

2.4. Click the blue Submit button to generate your dataset. 

What to expect

You will get a note if your dataset creation fails for any reason. The most common cause of failure is an incorrect attribute name (attributes can only contain lowercase letters and underscores). If your dataset creates successfully, you can move on to the next step, ingestion!!

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.