The Terra Data Repository helps you curate large datasets that can be integrated into Cloud analyses on Terra, while managing who can access your data. You can choose between multiple tools to interact with TDR, depending on your needs and background. This article guides you through which tools to use when, and where to find instructions for each step in your TDR journey.
Overview: using TDR
There are four main stages to using TDR:
- Set up a TDR billing profile
- Create a dataset and upload data
- Share the data
- Analyze the data
Different team members might work on each of these stages. For example, an administrator might set up the billing profile, a data manager might create the dataset and upload the data, a project lead might share the data, and an outside researcher might analyze the data on Terra. Because different team members have different needs and expertise, it's useful to have multiple tools available when navigating through these steps.
Tools for using TDR
There are three ways to interact with TDR:
1. TDR's user interface (UI). Logging into https://data.terra.bio/ brings you to a graphical interface where you can create a dataset, view your data, create snapshots to share data, and more. This makes it easy to examine your data, and it's especially useful for those who don't have a background using API endpoints. However, the website doesn't support as many functions as the Swagger APIs or Zebrafish.
2. Swagger API endpoints. TDR's full functionality is accessible through its Swagger API endpoints. This includes creating datasets, uploading data, creating and sharing snapshots, creating TDR billing profiles, managing permissions, creating assets, and checking the status of jobs launched from other interfaces. However, Swagger can be difficult to navigate if you're not already familiar with API endpoints.
3. Zebrafish. Zebrafish is a web-based tool that interfaces with the Swagger APIs so that you don't have to. It offers richer functionality than the TDR UI, but less than the Swagger APIs: you can create a dataset, configure file references in your data, upload and modify data, and create snapshots. Zebrafish also handles file references in tabular data better than the Swagger APIs.
Zebrafish is only available for data stored on the Google Cloud
Deciding which tool to use
Which of these tools should you use to manage your TDR data? The answer depends on what you’re trying to do in TDR – for example, whether you’re setting up billing or creating a dataset - and how familiar you are with APIs.
Constraints on your tools
- Note that some tasks can only be done with Swagger APIs, while others can be done in Swagger, the TDR UI, or Zebrafish.
- The tools available to you will also depend on whether your data are stored on the Google Cloud or Azure.
Why does API familiarity matter?In general, the Swagger APIs allow you to do more things in TDR; however, if you’re not already familiar with APIs, these can be a bit challenging to use. While some functions are only available through the Swagger endpoints, we recommend using the TDR UI or Zebrafish for functions that they support, unless you've worked with APIs before.
The rest of this article breaks down how to choose your tool for each stage of working in TDR. When different tools are available for a step, you'll find guidance on which tool to use, based on your cloud provider (Google or Azure) and your familiarity with APIs.
Step 1. Set up a TDR billing profile
You will use APIs to set up a billing profile to cover the costs of working in TDR. The exact steps depend on whether your data are stored on Azure or Google (GCP). See How to create a TDR Billing Profile (Azure) or How to create a TDR Billing Profile (GCP) for step-by-step instructions. You can also add collaborators to your TDR billing profile.
Step 2. Create a dataset and upload data
2.1. Define a dataset schema
Once you’ve set up billing and are ready to upload data to TDR, the next step is to define your dataset’s schema. The schema sets up the tables that hold your data and metadata, the tables' columns and primary keys, and the relationships between tables. Setting up your schema is crucial for updating the tables later on. Learn more about schemas in Overview: Defining your TDR dataset schema.
-
If you’re comfortable working with API endpoints, write your schema JSON then use Swagger to create a dataset with that schema.
-
If you’re not comfortable working with API endpoints, you have two options:
- Option 1: create the dataset and the schema in the TDR UI, then use Swagger to ingest and update your data (see the next section for details).
- Option 2: write the schema in JSON, then use Zebrafish to create the dataset and ingest your data in the same step. If you include your dataset's assets in your schema, you won't have to use Swagger to create the assets later on.
Creating a dataset through the UI vs. Zebrafish The benefit of creating your dataset in the TDR UI is that you can define the schema using a GUI, rather than working in JSON: you’ll type the names of your tables and their columns, and define their types with drop-down menus. The downside is that you can’t upload or edit your data through the TDR UI, so you’ll then have to use the Swagger API endpoints to complete those steps.
In contrast, if you use Zebrafish you’ll have to write your schema in JSON. But once you've done the initial work to create the JSON, you can create your dataset, upload your data, and update your data through the Zebrafish interface.
2.2. Ingest and update data
- If you’re comfortable working with APIs, ingest the data to your dataset through the Swagger API endpoints. If your data changes, you can also update it through these endpoints by running a new ingest job or soft-deleting and re-ingesting your data.
-
If you’re not comfortable working with APIs and your data is saved in the Azure cloud, you currently must ingest the data to your dataset through the Swagger API endpoints. If your data changes, you can also update it through these endpoints by running a new ingest job or soft-deleting and re-ingesting your data.
If your data is saved on the Google cloud, ingest your data to your dataset through Zebrafish – this is done in the same step as creating your dataset. If your data changes, you can also update it through Zebrafish.
Step 3. Share the data
To share TDR data, you'll create a snapshot — a subset of the dataset that you want to share with a particular researcher or group.
Assets are a prerequisite for creating a snapshot. Assets are subsets of the columns in your data tables that you want to include in snapshots. Learn more about assets in How to create dataset assets in TDR.
-
If you’re comfortable working with API endpoints, you can add assets to your dataset at any time, using the Swagger API endpoints. Then, use Swagger to create snapshots as well.
-
If you’re not comfortable working with API endpoints, you can define assets in your dataset’s schema JSON when you create your dataset through Zebrafish. The downside of this method is that you can’t modify your schema after the dataset is already created. Alternatively, you can add assets at any time using the Swagger API endpoints.
You can then create a snapshot through the TDR UI.
Once you’ve created a snapshot, how do you decide who can access the data? See Streamlining access for approved requestors with DUOS & TDR to learn how to screen researchers who want to access your TDR dataset.
Step 4. Analyze the data
See How to export a TDR snapshot and How to use TDR snapshots with workflows to learn how to import data from TDR into a Terra workspace, and analyze it in the Cloud.