Before you can ingest data into the data repo, you’ll need to create a Dataset into which you’ll ingest the data. Learn how in the step-by-step instructions below.
If you prefer to use Swagger, see (option 2) Create a TDR dataset with APIs. This might be a good option if you are comfortable with APIs and complex JSONs.
Step-by-step instructions
Start by logging into data.terra.bio and clicking the Create Dataset button.
Step 1. Submit dataset information
In the intro form, complete the required fields and drop-downs for the dataset.
Required fields
- Name (note that the name can only include letters, numbers, and underscores)
- Cloud Platform
- Billing Profile (this is the name of the billing profile you created in the previous step)
- Region (note that Terra's default region is us-central1)
You can also add a description, designate stewards and custodians, and choose secure monitoring.
Step 2. Build schema in TDR
Instructions in the browser walk through creating the dataset schema, table-by-table.
2.1. Use the blue button on the left to create a table. Repeat for each table in your schema.
2.2. Use the second blue button to add columns to each table. You will select the column name and datatype. Repeat for each attribute (column) in each table.
Data types (TDR, BigQuery, Azure Synapse)
When creating a dataset in TDR, you will need to supply the data type for each column. Use the table below (click to expand) will help guide your choices.
-
Most TDR types “pass-through” to BigQuery types of the same name. A few extra types are supported by the TDR, either as a convenience or to add more semantic information to the table metadata.
TDR Datatype
BigQuery Type
Synapse Type
Examples/
WarningsBOOLEAN
BOOLEAN
BIT
TRUE and FALSE
BYTES
BYTES
VARBINARY
Variable length binary data
DATE
DATE
DATE
'YYYY-[M]M-[D]D'
4-digit year, 1 or 2-digit month, and 1- or 2-digit date
DATETIME
DATETIME2
YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]]
Note: Datetime and Time data types do not care about timezone. BQ stores and returns them in the format provided.
TIME
TIME
[H]H:[M]M:[S]S[.DDDDDD|.F]
Note: TDR currently only accepts timestamps in timezone UTC. BQ stores this value as a long. In the UI, we do the conversion to UTC timestamp. However, the result from the previous data endpoint is a long value. If you are directly using our endpoint, you will have to perform this conversion to have an understandable value.
TIMESTAMP
DATETIME2
Format: YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time zone]
FLOAT
FLOAT
FLOAT
Float and Float64 point to the same underlying data types, so they are equivalent. FLOAT64
FLOAT
FLOAT
INTEGER
INTEGER
INT
INT64
INTEGER
BIGINT
NUMERIC
NUMERIC
REAL
For very large float data or for data where calculations will be performed on the data. STRING
STRING
varchar(8000)
TEXT
STRING
varchar(8000)
FILEREF
STRING
varchar(36)
Stores UUIDs that map to an ingested file. This is translated to DRS URLS on snapshot create. DIRREF
STRING
varchar(36)
2.3. Scroll down to the JSON view. It may be useful to copy the entire content (screenshot below) somewhere safe, for record-keeping.
You can also use this JSON as-is to create your dataset using APIs. See Create a TDR dataset (Swagger/API option).
2.4. Click the blue Submit button to generate your dataset.
What to expect
You will get a note if your dataset creation fails for any reason. The most often cause of failure is an incorrect attribute name (attributes can only contain lowercase letters and underscores). If your dataset creates successfully, you can move on to the next step, ingestion!!