Once you’ve created a dataset on TDR, your data may continue to evolve. Learn how to keep your dataset up-to-date without interacting with API endpoints, using Zebrafish. If you’d rather use APIs to update your data, or if you’re using Terra-on-Azure, see How to ingest data into TDR with APIs.
Three types of data updates
Zebrafish is a web application that allows you to upload and edit Terra Data Repository (TDR) datasets through a web interface, rather than API endpoints.
There are three ways to update TDR data with Zebrafish: append, replace and truncate_reload.
- Append: The “append” action adds rows to new or existing tables in your dataset. For example, you could run an append job to add new samples to your dataset as they come off of the sequencer.
- Replace: The “replace” action updates rows that already exist in a table, or adds new rows if the data do not already exist. In the former case, TDR will find a row with the same primary key (a unique identifier, such as sample id) as the one you’re uploading, and replace that row with the new one. For example, you could run a replace job to update a sample’s age or .bam file id.
- Truncate_reload: The “truncate_reload” action reloads all of your tabular data based on the contents of your Google Cloud bucket. Use this action to clear out all of your tabular data and load everything afresh; for example, if a sample was removed from your study or if your data has gone through many changes.
Unsupported Actions
Zebrafish does not currently support any of the following updating actions:
- Editing a dataset’s schema
- Deleting specific rows
- Deleting a whole dataset
- Changing who has access to a dataset
- Creating an asset (a subset of data table columns)
- Creating a snapshot of a subset of the data (rather than the full dataset)
Prerequisites
- You must be updating an existing TDR dataset where you are a steward or custodian.
- Your TDR billing profile must be shared with Zebrafish (see Add Zebrafish to your billing profile for step-by-step instructions).
- To perform a Replace action, the dataset must have a schema that specifies your data tables’ primary keys.
You cannot change a table's primary keys after adding it to a dataset. If you need to change or add the primary keys for an existing table, you must re-create it.-
-
- Log into https://data.terra.bio/.
- Click on the dataset’s name in the Dataset tab.
- Find the name of the table that you want to update in the Tables section of the left-hand panel, and click the plus sign next to the table’s name to expand its list of columns.
- Check that at least one column name is followed by an asterisk, which indicates a primary key.
-
-
Step 1. Update your data on the cloud
Your updated data must be staged in a Google Cloud Storage (GCS) bucket on which Zebrafish’s production service account (zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com
) is a principal and has storage.object.list permissions.
In addition, your TDR dataset's ingest service account must have access to this bucket. You can find your dataset's service account under ingest service account on your dataset's dataset summary tab on the TDR web interface.
You do not need to store your updated data in the same bucket in which the rest of your dataset is stored.
See Step 1: Upload your data to the Cloud to learn how to stage your data in a GCS bucket.
Step 2. Gather dataset information
Append, Replace, and Truncate_reload jobs all require the following information:
- The TDR Dataset’s name (find this by locating your dataset in the Datasets tab of the TDR website).
- The dataset’s billing profile id. Find this in your dataset’s Dataset Summary tab on the TDR website:
- The id for the Google Bucket where you data are saved on the Cloud.
Find this by logging into https://console.cloud.google.com/storage/browser and clicking on your bucket. - The path from the google bucket to your tabular data files. In the example below, the path would be
tabular_data
.
Tip: Find your google cloud information in Terra If you’ve staged your data on the cloud through a Terra workspace, you can find the google bucket in the Cloud Information section of your workspace’s dashboard, and view the bucket on the Google Console by clicking on open bucket in browser.
Step 3. Update your TDR dataset
Once you have updated your data on the Cloud and gathered the information listed above, you’re ready to translate those changes to your TDR dataset.
1. Log into Zebrafish.
2. Click on the wavy Pipeline Monitoring icon at the top left of the screen.
3. Click on New Ingestion at the top right of the screen.
4. In the window that appears, fill in the name of the dataset that you’re updating in the TDR Dataset field.
5. Under Dataset Type, select Existing
6. Under Dataset Action, select append, replace, or truncate_reload:
- Append: Adds new rows to your data, if any exist. The new data must comply with the existing data schema.
- Replace: Adds new rows to you data, or updates values on existing rows that match the data’s primary key(s). The data schema must specify the data table’s primary keys for this to work.
- Truncate_reload: Deletes and re-loads the tabular data in the dataset based on the data available in the corresponding Google Cloud bucket.
7. Specify your validation mode and whether you want to create a full-view snapshot for this ingestion.
8. Either choose an existing manifest or create a new manifest for the current job, following the instructions in How to create a dataset and ingest data with Zebrafish.
9. Review your manifest before submitting the job.
Check your manifest before submitting the ingestion Ensure that the dataset_action field is set to the correct action. If it isn’t set to the correct action, you can directly edit the text in the ingestion configuration box.
-
Note: if you’re using an existing manifest for this job, the Review & Submit screen will only display the fields that you have changed for the current job. To check the rest of the manifest, click Back to return to the Configuration screen. Click on the eye icon next to the Manifest ID field to preview the full manifest:
10. Click Submit Ingestion.
What to expect
Once you’ve submitted the job through Zebrafish, you can monitor its progress on the Pipeline Monitoring dashboard. Your job’s status will start at “queued,” then change to “running,” then “succeeded.” While the status is “running,” hover your mouse over the status to see which step is currently running. You may need to refresh the page periodically to see these status updates.
Once your job has succeeded, you can see the results by logging into TDR and navigating to your dataset. To see your updated data table, click on View Dataset Data at the top of the left-hand panel, then select the relevant table from the drop-down menu.
If you ran an append job, you should see new rows in the data table.
If you ran a replace job, you should see updated information in rows that were updated, and new rows (if any were available in your Google bucket).
If you ran a truncate_reload job, the rows in your data table should match the rows in your Google Cloud bucket — if you deleted a row in the bucket, it should also be gone from your TDR data table.
Next steps
- To learn how to make updates that aren’t currently supported in Zebrafish, see How to update a dataset’s schema, How to ingest data into TDR with APIs, and the Swagger API endpoints.
- To share a subset of your TDR dataset, see How to create snapshots in TDR and How to create dataset assets in TDR.
- To run a workflow on your TDR data, see How to use TDR Snapshots with workflows.