How to update TDR data with Zebrafish

Leyla Tarhan
  • Updated

Once you’ve created a dataset on TDR, your data may continue to evolve. Learn how to keep your dataset up-to-date without interacting with API endpoints, using Zebrafish. If you’d rather use APIs to update your data, or if you’re using Terra-on-Azure, see How to ingest data into TDR with APIs.

Three types of data updates

Zebrafish is a web application that allows you to upload and edit Terra Data Repository (TDR) datasets through a web interface, rather than API endpoints. 

There are three ways to update TDR data with Zebrafish: append, replace and truncate_reload.

  • Append: The “append” action adds rows to new or existing tables in your dataset. For example, you could run an append job to add new samples to your dataset as they come off of the sequencer. 
  • Replace: The “replace” action updates rows that already exist in a table, or adds new rows if the data do not already exist. In the former case, TDR will find a row with the same primary key (a unique identifier, such as sample id) as the one you’re uploading, and replace that row with the new one. For example, you could run a replace job to update a sample’s age or .bam file id.
  • Truncate_reload: The “truncate_reload” action reloads all of your tabular data based on the contents of your Google Cloud bucket. Use this action to clear out all of your tabular data and load everything afresh; for example, if a sample was removed from your study or if your data has gone through many changes.

Unsupported Actions

Zebrafish does not currently support any of the following updating actions:

  • Editing a dataset’s schema
  • Deleting specific rows
  • Deleting a whole dataset
  • Changing who has access to a dataset
  • Creating an asset (a subset of data table columns)
  • Creating a snapshot of a subset of the data (rather than the full dataset)

Prerequisites

  1. You must be updating an existing TDR dataset where you are a steward or custodian. 
  2. Your TDR billing profile must be shared with Zebrafish (see Add Zebrafish to your billing profile for step-by-step instructions).
  3. To perform a Replace action, the dataset must have a schema that specifies your data tables’ primary keys.
    You cannot add primary keys after creating the dataset. Instead, you must re-create the dataset using a schema that specifies primary keys.
        1. Log into https://data.terra.bio/.
        2. Click on the dataset’s name in the Dataset tab.
        3. Find the name of the table that you want to update in the Tables section of the left-hand panel, and click the plus sign next to the table’s name to expand its list of columns.
        4. Check that at least one column name is followed by an asterisk, which indicates a primary key.

Step 1. Update your data on the cloud

Your updated data must be staged in a Google Cloud Storage (GCS) bucket on which Zebrafish’s production service account (zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com) is a principal and has storage.object.list permissions.

In addition, your TDR dataset's ingest service account must have access to this bucket. You can find your dataset's service account under ingest service account on your dataset's dataset summary tab on the TDR web interface.

You do not need to store your updated data in the same bucket in which the rest of your dataset is stored.

See Step 1: Upload your data to the Cloud to learn how to stage your data in a GCS bucket.

Step 2. Gather dataset information

Append, Replace, and Truncate_reload jobs all require the following information:

  1. The TDR Dataset’s name (find this by locating your dataset in the Datasets tab of the TDR website).
  2. The dataset’s billing profile id. Find this in your dataset’s Dataset Summary tab on the TDR website:
    Screenshot showing the dataset summary tab for an example dataset on the TDR website. An orange rectangle highlights the default billing profile id for this dataset.
  3. The id for the Google Bucket where you data are saved on the Cloud.
    Find this by logging into https://console.cloud.google.com/storage/browser and clicking on your bucket. Screenshot showing an example Google bucket in the Google Cloud console. Orange arrows highlight two locations where you can find your bucket's id (redacted from this example for privacy).
  4. The path from the google bucket to your tabular data files. In the example below, the path would be tabular_data.
    Screenshot showing a folder containing tabular data i an example Google bucket. An orange rectangle highlights the name of this folder.

Tip: Find your google cloud information in Terra If you’ve staged your data on the cloud through a Terra workspace, you can find the google bucket in the Cloud Information section of your workspace’s dashboard, and view the bucket on the Google Console by clicking on open bucket in browser.

Step 3. Update your TDR dataset

Once you have updated your data on the Cloud and gathered the information listed above, you’re ready to translate those changes to your TDR dataset.

1. Log into Zebrafish.

2. Click on the wavy Pipeline Monitoring icon at the top left of the screen.

Screenshot showing the Pipeline Monitoring dashboard on the Zebrafish website. An orange rectangle highlights the piepeline monitoring icon on the upper left of the screen.

3. Click on New Ingestion at the top right of the screen.

4. In the window that appears, fill in the name of the dataset that you’re updating in the TDR Dataset field.

Screenshot of the new ingestion form for an example dataset on Zebrafish. An orange rectangle and the number 4 highlight the TDR Dataset field. An orange rectangle and the number 5 highglight the 'Existing' button under Dataset Type. An orange number 6 highlights the Dataset Action field, where there are options for Append, Replace, and Truncate_Reload.

5. Under Dataset Type, select Existing

6. Under Dataset Action, select append, replace, or truncate_reload:

  • Append: Adds new rows to your data, if any exist. The new data must comply with the existing data schema.
  • Replace: Adds new rows to you data, or updates values on existing rows that match the data’s primary key(s). The data schema must specify the data table’s primary keys for this to work.
  • Truncate_reload: Deletes and re-loads the tabular data in the dataset based on the data available in the corresponding Google Cloud bucket.

7. Specify your validation mode and whether you want to create a full-view snapshot for this ingestion. 

8. Either choose an existing manifest or create a new manifest for the current job, following the instructions in How to create a dataset and ingest data with Zebrafish.

9. Review your manifest before submitting the job.

Check your manifest before submitting the ingestion Ensure that the dataset_action field is set to the correct action. If it isn’t set to the correct action, you can directly edit the text in the ingestion configuration box. Screenshot showing the manifest for an example 'append' job. An orange rectangle highlights the 'dataset_action' field, which is set to 'APPEND'.

  • Note: if you’re using an existing manifest for this job, the Review & Submit screen will only display the fields that you have changed for the current job. To check the rest of the manifest, click Back to return to the Configuration screen. Click on the eye icon next to the Manifest ID field to preview the full manifest:
    previewManifest.png

10. Click Submit Ingestion.

What to expect

Once you’ve submitted the job through Zebrafish, you can monitor its progress on the Pipeline Monitoring dashboard. Your job’s status will start at “queued,” then change to “running,” then “succeeded.” While the status is “running,” hover your mouse over the status to see which step is currently running. You may need to refresh the page periodically to see these status updates.

Once your job has succeeded, you can see the results by logging into TDR and navigating to your dataset. To see your updated data table, click on View Dataset Data at the top of the left-hand panel, then select the relevant table from the drop-down menu.

Screenshot of an example dataset on the TDR website. An orange rectangle highlights the 'View Dataset Data' button on the left-hand panel of the dataset summary page.
Screenshot showing how to view the data in a specific data table. An orange rectangle highlights the drop-down menu used to select a specific table.

If you ran an append job, you should see new rows in the data table.

If you ran a replace job, you should see updated information in rows that were updated, and new rows (if any were available in your Google bucket).

If you ran a truncate_reload job, the rows in your data table should match the rows in your Google Cloud bucket — if you deleted a row in the bucket, it should also be gone from your TDR data table.

Next steps

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.