How to update TDR data using APIs

Anton Kovalsky
  • Updated

Learn how to use APIs to update the values stored in a Terra Data Repository dataset. To learn how to update TDR data without through a web interface instead, read How to update TDR data with Zebrafish.

Overview: Updating data with soft delete

You can update data in a TDR table in two ways: through the Zebrafish web interface, or with the "soft delete" action, which is available through the TDR's Swagger API interface. This article explains the "soft delete" approach.

  • Step 1: Gather the datarepo_row_id’s and row values for the rows that need the update. Store the row_ids in a CSV and upload the CSV to a cloud bucket.
  • Step 2: Send the soft delete manifest (CSV) to the soft delete API to remove the data you wish to change.
  • Step 3: Use the ingest API to add new rows of data, using row values from step 1.

Step 1: Create Manifest for Data to be Deleted

1.1: The scenario

The example below shows a table named “sample”. The goal is to update the data to correct a sample name from "Jill" to "Jillian".

{"sample_id": "sample_5", "participant_id": "jill", "file": 
"3ca3b18e-62a5-4b7c-b104-506a04a71abc", "type": "tumor"}

{"sample_id": "sample_6", "participant_id": "jill", "file":
"494c60e1-ce54-478f-8b19-80b51dbb0cb0", "type": "normal"}

Dataset UUID: d9f5c80b-6435-486d-93fd-e71a34b51686

Profile UUID: 390e7a85-d47f-4531-b612-165fc977d3bd

Table name: “sample”

BigQuery location: `broad-jade-dev-data.datarepo_example_dr_dataset.sample`

1.2: Querying row ID and values 

The first thing to do is to run a query that gathers the relevant row ids and row values (scroll right in the box below to see the full query): 

bq query --format prettyjson --nouse_legacy_sql 'SELECT * FROM `broad-jade-dev-data.datarepo_example_dr_dataset.sample` where sample_id = "sample_6"'

Waiting on bqjob_r45dddd9c12162066_0000017c805b1a2b_1 ... (0s) Current status: DONE   

[  {
    "datarepo_row_id": "00540f07-4615-44ea-b99c-54ab1b558a08",
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jill",
   "sample_id": "sample_6", 
   "type": "normal"
 }
]

1.3 Store the row IDs

First, store the row ids in a CSV file and stage it by uploading it to a cloud bucket. For this example, the bucket location and file name for the CSV you just created can be something like:

gs://staging-bucket/soft-delete.csv

In our example, this file just has one value, the datarepo_row_id: 00540f07-4615-44ea-b99c-54ab1b558a08

This is the file you will send to the soft delete API in the next step. The API will use this value to find the row to delete.

Make sure that TDR can access your bucket Remember that the bucket where you save the CSV with the to-be-deleted row id's needs to have the proper permissions in order for TDR's service account to be able to access it. If you haven't added Storage Object Viewer Access for the TDR into this bucket, you can find instructions on how to do this at the top of the article on ingesting data.

1.4 Store the row values that will be used later

The previous step helps you delete the data you wish to change, but the file itself will remain in the dataset's virtual file system. The last step of this tutorial is to add back data pointing to the file again. For this you'll need the value that points to the file itself. This value is stored in the "file" field in the query from step 1.2. Store this value somewhere so that you can later make a JSON object containing the file ID and the updated data, as shown below

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

Step 2: Soft Delete API

Once you've staged your soft delete manifest CSV on the Cloud, you're ready to delete the data you want to remove from your dataset. Remember that once the data's row is gone, any files it was pointing to will still be present in the TDR in the dataset's virtual file system, but these files will no longer be visible through the TDR UI, and they won't be able to be shared in any snapshots until you add back some data pointing to those files.

Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.

Instructions
- Click Authorize near the top of the page.
- Choose an authentication method: googleoauth or oidc.
- If you're running the createProfile endpoint to create a google-backed billing profile, use the googleoauth authentication and check all four boxes (including the last one about billing, which may not be checked by default).
- Otherwise, choose either authentication method, but not both.
- click Authorize again.
- Input the appropriate credentials.
- When you close the pop-up window, do not click Sign Out.

You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

2.1 Use the soft delete API.

softDelete request body:

{
 "deleteType": "soft", 
 "specType": "gcsFile",
 "tables": [
   {
     "gcsFileSpec": {
       "fileType": "csv",
       "path": "gs://ruchi-staging-bucket/soft_delete.csv"
     },
     "tableName": "sample"
   }
 ]
}

Request body parameters

  • deleteType” will be “soft”, and “specType” will be “gcsFile”, as the information of what  needs to be deleted is informed inside of a file in gcs.
  • tables” are the list of tables that the updates are being applied to. Note that a soft delete CSV manifest has to be created for each table being updated

2.2 Validate the delete

The easiest way to validate the soft delete finished is to look at the dataset in the Data Repo to see if the soft deleted rows are no longer visible.  

Step 3: Re-Ingest Updated Data

The final step is to use the ingest API to re-ingest some data pointing to the files for which you're updating the data.

3.1 Build your updated JSON

Use the JSON you acquired with the query in step 1.2 and modify it as shown in step 1.4, or however you want to modify it with updated data.

JSON example

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

3.2 Store the JSON in a cloud bucket 

Similarly to step 1.3, you'll want to stage the JSON above as a separate file in a Google bucket (again, make sure the TDR has been granted access to this bucket, as described in the article on ingesting data). In this example the path to this file would look something like this:

gs://staging-bucket/samples.upsert.json

3.3 Re-ingest updated values

Use the ingest API to ingest the updated data in the JSON file by executing the API request.

ingest request body example

{
 "format": "json",
 "ignore_unknown_values": true,
 "load_tag": "soft-delete-upsert-1",
 "max_bad_records": 0,
 "path": "gs://ruchi-staging-bucket/samples.upsert.json",
 "profile_id": "390e7a85-d47f-4531-b612-165fc977d3bd",
 "resolve_existing_files": true, #set to true, as you may want to update file id
 "table": "sample"
}

How to validate updates

You should now be able to validate that this update process worked by going to the dataset via the TDR UI and finding the re-populated row with the updated data.



Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.