Before you can ingest data into the data repo, you’ll need to create a Dataset into which you’ll ingest the data. Learn how in the step-by-step instructions below.
If you prefer to use Swagger, see How to create a TDR dataset with APIs. This might be a good option if you are comfortable with APIs and complex JSONs.
Step-by-step instructions
Start by logging into data.terra.bio and clicking the Create Dataset button.
Step 1. Submit dataset information
In the intro form, complete the required fields and drop-downs for the dataset.
Required fields
- Name (note that the name can only include letters, numbers, and underscores)
- Cloud Platform (if your data are stored on the Google Cloud and you're using a Google-backed TDR billing profile, choose Google Cloud Platform. If your data are stored on the Azure Cloud and you're using an Azure-backed TDR billing profile, choose Microsoft Azure).
- Billing Profile (this is the name of your TDR billing profile, which you can create by following the instructions in How to create a TDR Billing Profile (GCP) or How to create a TDR Billing Profile (Azure))
- Region (the location of the storage bucket/container where your TDR dataset will be stored)
You can also add a description, designate stewards and custodians, and choose secure monitoring. Secure monitoring logs all requests to access your data, and saves these logs to the same location where your data are staged to the cloud (e.g., an Azure storage container).
Step 2. Build schema in TDR
The schema sets up your dataset's tables, including their columns, data types, and relationships between tables. Instructions on the TDR website walk you through creating the schema, table-by-table.
2.1. Use the blue button on the left to create a table. Repeat for each table in your schema.
2.2. Use the second blue button to add columns to each table. You will select the column name and datatype. Repeat for each attribute (column) in each table.
Data types (TDR, BigQuery, Azure Synapse)
When creating a dataset in TDR, you will need to supply the data type for each column. Use the table below (click to expand) will help guide your choices.
-
Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.
TDR Datatype
BigQuery Type
Synapse Type
Examples/
WarningsBOOLEAN
BOOLEAN
BIT
TRUE and FALSE
BYTES
BYTES
VARBINARY
Variable length binary data
DATE
DATE
DATE
'YYYY-[M]M-[D]D'
4-digit year, 1 or 2-digit month, and 1- or 2-digit date
DATETIME
DATETIME2
YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]
Note: Datetime and Time data types do not care about timezone. BQ stores and returns them in the format provided.
TIME
TIME
[H]H:[M]M:[S]S[.DDDDDD|.F]
Note: TDR currently only accepts timestamps in timezone UTC. BQ stores this value as a long. In the UI, we do the conversion to UTC timestamp. However, the result from the previous data endpoint is a long value. If you are directly using our endpoint, you will have to perform this conversion to have an understandable value.
TIMESTAMP
DATETIME2
Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]
FLOAT
FLOAT
FLOAT
Float and Float64 point to the same underlying data types, so they are equivalent. FLOAT64
FLOAT
FLOAT
INTEGER
INTEGER
INT
INT64
INTEGER
BIGINT
NUMERIC
NUMERIC
REAL
For very large float data or for data where calculations will be performed on the data. STRING
STRING
varchar(8000)
TEXT
STRING
varchar(8000)
FILEREF
STRING
varchar(36)
Stores UUIDs that map to an ingested file. This is translated to DRS URLS on snapshot create. DIRREF
STRING
varchar(36)
2.3. Scroll down to the JSON view. It may be useful to copy the entire content (screenshot below) somewhere safe, for record-keeping.
You can also use this JSON as-is to create your dataset using APIs. See How to create a TDR dataset with APIs.
2.4. Click the blue Submit button to generate your dataset.
What to expect
You will get a note if your dataset creation fails for any reason. The most common cause of failure is an incorrect attribute name (attributes can only contain lowercase letters and underscores). If your dataset creates successfully, you can move on to the next step, ingestion!!