How to soft-delete and re-ingest TDR data using APIs

Anton Kovalsky
  • Updated

Learn how to use APIs to remove and replace rows from a data table stored in a Terra Data Repository dataset, which is particularly useful if your tables' primary keys are not set.

To learn about alternative ways to update your TDR data using APIs, read How to update TDR data with APIs. To learn how to If your data are stored on the Google Cloud, learn how to update TDR data through a web interface instead by reading How to update TDR data with Zebrafish.

Overview: Updating TDR data tables

You can update data in a TDR table in three ways:

  1. Zebrafish - The Zebrafish web interface allows you to update your data without navigating API endpoints. Zebrafish is only available for datasets stored on the Google Cloud. Learn more by reading How to update TDR data with Zebrafish.
  2. Ingest data with Swagger - Running a replace or merge ingest job through TDR's ingestDataset API endpoint will over-write data in an existing data table to reflect changes. Learn more by reading How to ingest data into TDR with APIs. Your table's primary key(s) must be set to use this approach.
  3. Soft-delete and re-ingest data with Swagger - Use the applyDatasetDataDeletion API endpoint to soft-delete specific rows from a data table, then use the ingestDataset API endpoint to upload new data. This is useful when you need to completely clear out and refresh old data, or when your tables do not already have primary keys. This article explains the "soft delete" approach.

Overview: updating data tables with soft-delete

There are three steps to soft-deleting your data:

  • Step 1: Gather the datarepo_row_id’s and row values for the rows that need the update. Store the row_ids in a CSV and upload the CSV to a cloud bucket.
  • Step 2: Send the soft delete manifest (CSV) to the soft delete API to remove the data you wish to change.
  • Step 3: Use the ingest API to add new rows of data, using row values from step 1.

Step 1: Create Manifest for Data to be Deleted

1.1: The scenario

The example below shows a table named “sample”. The goal is to update the data to correct a sample name from "Jill" to "Jillian".

{"sample_id": "sample_5", "participant_id": "jill", "file": 
"3ca3b18e-62a5-4b7c-b104-506a04a71abc", "type": "tumor"}

{"sample_id": "sample_6", "participant_id": "jill", "file":
"494c60e1-ce54-478f-8b19-80b51dbb0cb0", "type": "normal"}

Dataset UUID: d9f5c80b-6435-486d-93fd-e71a34b51686

Profile UUID: 390e7a85-d47f-4531-b612-165fc977d3bd

Table name: “sample”

BigQuery location: `broad-jade-dev-data.datarepo_example_dr_dataset.sample`

1.2: Querying row ID and values 

The first thing to do is to run a query that gathers the relevant row ids and row values (scroll right in the box below to see the full query): 

bq query --format prettyjson --nouse_legacy_sql 'SELECT * FROM `broad-jade-dev-data.datarepo_example_dr_dataset.sample` where sample_id = "sample_6"'

Waiting on bqjob_r45dddd9c12162066_0000017c805b1a2b_1 ... (0s) Current status: DONE   

[  {
    "datarepo_row_id": "00540f07-4615-44ea-b99c-54ab1b558a08",
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jill",
   "sample_id": "sample_6", 
   "type": "normal"
 }
]

1.3 Store the row IDs

First, store the row ids in a CSV file and stage it by uploading it to a cloud bucket. For this example, the bucket location and file name for the CSV you just created can be something like:

gs://staging-bucket/soft-delete.csv

In our example, this file just has one value, the datarepo_row_id: 00540f07-4615-44ea-b99c-54ab1b558a08

This is the file you will send to the soft delete API in the next step. The API will use this value to find the row to delete.

Make sure that TDR can access your bucket Remember that the bucket where you save the CSV with the to-be-deleted row id's needs to have the proper permissions in order for TDR's service account to be able to access it. If you haven't added Storage Object Viewer Access for the TDR into this bucket, you can find instructions on how to do this at the top of the article on ingesting data.

1.4 Store the row values that will be used later

The previous step helps you delete the data you wish to change, but the file itself will remain in the dataset's virtual file system. The last step of this tutorial is to add back data pointing to the file again. For this you'll need the value that points to the file itself. This value is stored in the "file" field in the query from step 1.2. Store this value somewhere so that you can later make a JSON object containing the file ID and the updated data, as shown below

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

Step 2: Soft Delete API

Once you've staged your soft delete manifest CSV on the Cloud, you're ready to delete the data you want to remove from your dataset. Remember that once the data's row is gone, any files it was pointing to will still be present in the TDR in the dataset's virtual file system, but these files will no longer be visible through the TDR UI, and they won't be able to be shared in any snapshots until you add back some data pointing to those files.

Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

2.1 Use the soft delete API.

softDelete request body:

{
 "deleteType": "soft", 
 "specType": "gcsFile",
 "tables": [
   {
     "gcsFileSpec": {
       "fileType": "csv",
       "path": "gs://ruchi-staging-bucket/soft_delete.csv"
     },
     "tableName": "sample"
   }
 ]
}

Request body parameters

  • deleteType” will be “soft”, and “specType” will be “gcsFile”, as the information of what  needs to be deleted is informed inside of a file in gcs.
  • tables” are the list of tables that the updates are being applied to. Note that a soft delete CSV manifest has to be created for each table being updated

2.2 Validate the delete

The easiest way to validate the soft delete finished is to look at the dataset in the Data Repo to see if the soft deleted rows are no longer visible.  

Step 3: Re-Ingest Updated Data

The final step is to use the ingest API to re-ingest some data pointing to the files for which you're updating the data.

3.1 Build your updated JSON

Use the JSON you acquired with the query in step 1.2 and modify it as shown in step 1.4, or however you want to modify it with updated data.

JSON example

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

3.2 Store the JSON in a cloud bucket 

Similarly to step 1.3, you'll want to stage the JSON above as a separate file in a Google bucket (again, make sure the TDR has been granted access to this bucket, as described in the article on ingesting data). In this example the path to this file would look something like this:

gs://staging-bucket/samples.upsert.json

3.3 Re-ingest updated values

Use the ingest API to ingest the updated data in the JSON file by executing the API request.

ingest request body example

{
 "format": "json",
 "ignore_unknown_values": true,
 "load_tag": "soft-delete-upsert-1",
 "max_bad_records": 0,
 "path": "gs://ruchi-staging-bucket/samples.upsert.json",
 "profile_id": "390e7a85-d47f-4531-b612-165fc977d3bd",
 "resolve_existing_files": true, #set to true, as you may want to update file id
 "table": "sample"
}

How to validate updates

You should now be able to validate that this update process worked by going to the dataset via the TDR UI and finding the re-populated row with the updated data.



Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.