How to update metadata using soft delete

Anton Kovalsky
  • Updated

Learn how to update metadata (i.e. row values in an asset table) in the Terra Data Repository.

Overview: Updating metadata with soft delete

The current best practice for doing updating metadata in a TDR table is to use the "soft delete" action available through the TDR's Swagger API interface.

  • Step 1: Gather the datarepo_row_id’s and row values for the rows that need the update. Store the row_ids in a CSV and upload the CSV to a cloud bucket.
  • Step 2: Send the soft delete manifest (CSV) to the soft delete API to remove the metadata you wish to change.
  • Step 3: Use the ingest API to add new rows of data, using row values from step 1.

Step 1: Create Manifest for Data to be Deleted

1.1: The scenario

The example below shows a table named “sample”. The goal is to update the metadata - correcting the name from "Jill" to "Jillian".

The same process applies for updating any data type in a given row. If you are updating a data file, the file has to be updated before you have the file id used in the ingest.

{"sample_id": "sample_5", "participant_id": "jill", "file": 
"3ca3b18e-62a5-4b7c-b104-506a04a71abc", "type": "tumor"}

{"sample_id": "sample_6", "participant_id": "jill", "file":
"494c60e1-ce54-478f-8b19-80b51dbb0cb0", "type": "normal"}

Dataset UUID: d9f5c80b-6435-486d-93fd-e71a34b51686

Profile UUID: 390e7a85-d47f-4531-b612-165fc977d3bd

Table name: “sample”

BigQuery location: `broad-jade-dev-data.datarepo_example_dr_dataset.sample`

1.2: Querying row ID and values 

The first thing to do is to run a query that gathers the relevant row ids and row values (scroll right in the box below to see the full query): 

bq query --format prettyjson --nouse_legacy_sql 'SELECT * FROM `broad-jade-dev-data.datarepo_example_dr_dataset.sample` where sample_id = "sample_6"'

Waiting on bqjob_r45dddd9c12162066_0000017c805b1a2b_1 ... (0s) Current status: DONE   

[  {
    "datarepo_row_id": "00540f07-4615-44ea-b99c-54ab1b558a08",
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jill",
   "sample_id": "sample_6", 
   "type": "normal"
 }
]

1.3 Store the row IDs

First, store the row ids in a CSV file and stage it by uploading it to a cloud bucket. For this example, the bucket location and file name for the CSV you just created can be something like:

gs://staging-bucket/soft-delete.csv

In our example, this file just has one value, the datarepo_row_id: 00540f07-4615-44ea-b99c-54ab1b558a08

This is the file you will send to the soft delete API in the next step. The API will use this value to find the row of metadata to delete. Remember that this bucket needs to have the proper permissions in order for the TDR's service account to be able to access it. If you haven't added Storage Object Viewer Access for the TDR into this bucket, you can find instructions on how to do this at the top of the article on ingesting data.

1.4 Store the row values that will be used later

The previous step helps you delete the metadata you wish to change, but the file itself will remain in the dataset's virtual file system. The last step of this tutorial is to add back metadata pointing to the file again. For this you'll need the value that points to the file itself. This value is stored in the "file" field in the query from step 1.2. Store this value somewhere so that you can later make a JSON object containing the file ID and the updated metadata, as shown below

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

Step 2: Soft Delete API

Once you have your soft delete manifest CSV staged, you're ready to delete the metadata. Remember that once the metadata is gone, any files the metadata were pointing to will still be present in the TDR in the dataset's virtual file system, but these files will no longer be visible through the TDR UI, and they won't be able to be shared in any snapshots until you add back some metadata pointing to those files.

Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.

Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.

You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

Use the soft delete API.

softDelete request body

{
 "deleteType": "soft", 
 "specType": "gcsFile",
 "tables": [
   {
     "gcsFileSpec": {
       "fileType": "csv",
       "path": "gs://ruchi-staging-bucket/soft_delete.csv"
     },
     "tableName": "sample"
   }
 ]
}

Request body parameters

  • deleteType” will be “soft”, and “specType” will be “gcsFile”, as the information of what  needs to be deleted is informed inside of a file in gcs.
  • tables” are the list of tables that the updates are being applied to. Note that a soft delete CSV manifest has to be created for each table being updated

Validating the delete

The easiest way to validate the soft delete finished is to look at the dataset in the Data Repo to see if the soft deleted rows are no longer visible.  

Step 3: Re-Ingest Updated Data

The final step is to use the ingest API to re-ingest some metadata pointing to the files for which you're updating metadata.

3.1 Build your updated JSON

Use the JSON you acquired with the query in step 1.2 and modify it as shown in step 1.4, or however you want to modify it with updated metadata.

JSON example

{
   "file": "494c60e1-ce54-478f-8b19-80b51dbb0cb0", 
   "participant_id": "Jillian",
   "sample_id": "sample_6", 
   "type": "normal"
 }

3.2 Store the JSON in a cloud bucket 

Similarly to step 1.3, you'll want to stage the JSON above as a separate file in a Google bucket (again, make sure the TDR has been granted access to this bucket, as described in the article on ingesting data). In this example the path to this file would look something like this:

gs://staging-bucket/samples.upsert.json

3.3 Re-ingest updated values

Use the ingest API to ingest the updated metadata in the samples.upsert.json file by executing the API request.

ingest request body example

{
 "format": "json",
 "ignore_unknown_values": true,
 "load_tag": "soft-delete-upsert-1",
 "max_bad_records": 0,
 "path": "gs://ruchi-staging-bucket/samples.upsert.json",
 "profile_id": "390e7a85-d47f-4531-b612-165fc977d3bd",
 "resolve_existing_files": true, #set to true, as you may want to update file id
 "table": "sample"
}

How to validate updates

You should now be able to validate that this update process worked by going to the dataset via the TDR UI and finding the re-populated row with the updated metadata.



Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.