How to ingest and update TDR data with APIs

Anton Kovalsky
  • Updated

Learn how to use Swagger API endpoints to ingest new or updated data (including individual files or data tables) into the Terra Data Repository (TDR). To completely remove or replace data table rows, see How to soft-delete and re-ingest TDR data using APIs.

If you prefer not to use API endpoints and your data are stored on the Google cloud, see How to create a dataset and ingest data with Zebrafish and How to update TDR data with Zebrafish.

Overview: Ingesting data into TDR

Once you’ve created a TDR dataset, you’re ready to start ingesting data files (such as CRAM or FASTQ files) and populate the columns of your dataset’s tables. Ingesting makes data files available in your Data Repo datasets for snapshotting. 

Adding new data

To add rows to a TDR data table, use the ingestDataset API endpoint with the default updateStrategy (append). To add files to a TDR dataset, use the ingestFile API endpoint (for a single file) or the bulkFileLoad API endpoint (for multiple files).

Updating data in tables

There are three ways to use TDR's Swagger API endpoints to update a data table as the data evolve:

  1. Replace - Use the ingestDataset API endpoint to ingest data with a replace update strategy. This will replace entire row(s) of a data table with new data. For instructions, see Step 2. Set up the ingestDataset request body below. You must specify your table's primary keys to use this strategy.
  2. Merge - Use the ingestDataset API endpoint to ingest data with a merge update strategy. This will update specific columns of the data table for existing rows. For instructions, see Step 2. Set up the ingestDataset request body below. You must specify your table's primary keys to use this strategy.
  3. Soft-delete and re-upload - Use the applyDatasetDataDeletion API endpoint to soft-delete specific rows, then use the ingestDataset API endpoint to upload new data. This approach requires more steps than the merge or replace approaches, but it is useful when you need to completely clear out your data, or if your table does not already have primary keys. For instructions, see How to soft-delete and re-ingest TDR data using APIs

Updating files

There are two ways to update TDR files:

  1. Replace or Merge - you can use the ingestDataset API endpoint with the replace or merge update strategy to update files, as well as tabular data. For instructions, see Step 2. Set up the ingestDataset request body below.
  2. Delete and re-ingest the file - alternatively, you can use the deleteFile endpoint to remove the file, then ingest the new version with the ingestFile endpoint.

If you'd rather not use APIs to update your data, and your data are stored in the Google Cloud, see How to create a dataset and ingest data with Zebrafish and How to update TDR data with Zebrafish.

Updating your data doesn't update existing snapshots When you create a Snapshot of TDR data, the data in the Snapshot are frozen in time. Adding or updating the data in your dataset will not affect the data in the Snapshot.

Step 1. Stage your data in the Cloud

Before TDR can ingest data, it needs to be staged, or stored in the Cloud (a Google Bucket or Azure blob container). 

Staging your data on GCP vs. Azure The ingest process is slightly different for data staged in the Google Cloud or Azure.

However, you don't have to use the same cloud to stage your data and host your TDR dataset:
- If your data is staged in the Google cloud, you can ingest it into either a Google- or Azure-based TDR dataset.
- If your data is staged in Azure, you can currently only ingest it into an Azure-based TDR dataset.

Regardless of your cloud provider, you can stage your data in a Terra workspace or outside of Terra (in an external Google Bucket or Azure container).

  • To stage your TDR data in the Google Cloud, you can add it to a Terra workspace linked to a Google billing account, or to an external Google bucket. TDR will use a service account to copy your data from your bucket into a TDR dataset.

    Information to gather

    1. The billing profile UUID generated when creating your TDR spend profile.
      You can retrieve the billing profile UUID by executing the enumerateProfiles endpoint.
    2. The dataset UUID generated during the createDataset step
      You can retrieve the dataset UUID by executing the enumerateDatasets endpoint
    3. The TDR service account associated with your dataset
      This account must have the role Storage Object Viewer on any GCP buckets used as a source for ingests (see instructions below).
    4. The UUID for the bucket where the data to be ingested is staged

    How to connect the service account to your data

    1. Retrieve the service account

    Use the retrieveDataset endpoint to get the service account (in the “ingestServiceAccount” field of the response body) to use for ingestion. It will either be the global TDR service account or a dedicated TDR service account.

    Alternatively, you can find the service account under ingest service account on the dataset summary tab in TDR.

    Screenshot of an example TDR dataset's summary page. A red rectangle highlights the ingest service account.

    2. Grant permissions on the bucket

    You need to give TDR permission to access the bucket where the data is staged. How you do this depends on whether the data is in Terra workspace storage or an external bucket.

    • 2.1. Share the workspace (Reader access) with the service account.

      Screenshot of share workspace modal with current collaborators acliffe@broadinstitute.org - owner - and the service account datarepo-jade-api@terra-datarepo-production.iam.gserviceaccount.com as a reader

    • 2.1. Go to the Google Cloud console storage browser and grant the service account the necessary permissions on the data bucket.

      Role > Storage Object Viewer

      2.2. Add your personal proxy email for your Terra profile's pet service account to the data bucket permissions (Role > Storage Object Viewer).
      Screenshot of Terra user profile page with proxy group - found below the institution field - highlighted

    Make sure to use the right service account To ingest your data, it must be accessible to the same service account that TDR used to create your dataset. By default, TDR will create a dedicated service account for your dataset, which makes the ingest process faster and more secure. In some cases, TDR may use a generic service account instead.

    In all cases, the data ingest process will not work unless you connect the service account assigned to your dataset with the cloud bucket where your data are staged.

  • To stage your data in Azure, you can add it to a Terra workspace linked to an Azure billing account, or to an external Azure storage container. Note that data staged in Azure can only be added to an Azure-based TDR dataset.

    Information to gather

    1. The billing profile UUID generated when creating your TDR spend profile.
      You can retrieve the billing profile UUID by executing the enumerateProfiles endpoint.
    2. The dataset UUID generated during the createDataset step
      You can retrieve the dataset UUID by executing the enumerateDatasets endpoint
    3. Signed URLs for each file you intend to include in your TDR dataset.
      A signed URL temporarily grants TDR access to a file during the ingest process.

      Keep in mind that, once the URL expires, you will have to generate a new signed URL in order for TDR to ingest a file.

    How to generate a signed URL

    • 1. Click on the Browse workspace files icon on the right-hand panel of any workspace screen. Navigate through your folder structure using the menus on the left-hand panel to locate your file.
      Screenshot showing the Browse Workspace Files icon, which looks like a folder and is located below your workspace's Cloud Environment Configuration icon.

      Screenshot showing the files stored in an example Azure-backed workspace's storage. An orange rectangle highlights the menu on the left-hand side used to navigate through the workspace storage folder structure in order to locate the files of interest.
      2. Click on the file you want to ingest into TDR. Locate the Terminal download command at the bottom of the window that appears.
      downloadWindow.png
      From this command, copy only the URL (the portion in single quotes). This URL will be valid for 8 hours.
    • 1. Navigate to your storage container in the Azure portal. 2. Locate your file, click the three dots to the right of the file name, and select Generate SAS.
      Screenshot showing how to generate a signed URL for an example Azure file.
      3. Under Permissions select Read. 4. Specify when the signed URL should start and stop being valid (by default, the URL will be valid for 8 hours). 5. Click Generate SAS Token and URL.
      Screenshot showing how to generate a signed URL for an example Azure file. Orange rectangles highlight the dropdown menu used to assign 'read' permissions for the URL, the menu used to set when the URL's valid period starts and ends, and the button used to generate the URL.
    The resulting signed URL will have the following format: https://[azure storage account].blob.core.windows.net/[file path]/[file name]?sv=[SAS token].

Ingest Data Tables

Step 1. Set up the ingestDataset request body (JSON)

The ingestDataset API endpoint allows you to simultaneously ingest files (e.g., FASTQ files) and data stored in tables. This API first ingests the data files and then populates a table with new rows that include paths to the ingested files (DRS URIs). Note that you'll need to make one API request per table of data ingested.

Do you need primary keys?In order to update your data with a replace or merge ingestDataset API job, your tables must have primary keys (columns used to uniquely identify each row in a table). A table's primary keys are specified in the dataset schema when creating the dataset, and cannot be changed without re-creating the table. See How to write a TDR dataset schema to learn how to specify your primary keys.

If your tables do not have primary keys, you can still run an append ingestDataset job to ingest new data. However, this is not a good way to update existing rows, because append will not over-write old data. If you need to update a data table without primary keys, see How to soft-delete and re-ingest TDR data using APIs for an alternative.

ingestDataset JSON

The base of the request body should be formatted as follows. Note that you'll format the data records (the metadata and data files to be ingested) in the next step.

{
"format": "array",
"load_tag": "/*ingest_to_example_table_2023-02-15_16-43*/",
"table": "/*example_table*/",
"updateStrategy": "append",
"resolve_existing_files": true,
"records": [/*{DATA RECORDS}*/],
"bulkMode": false
}

JSON parameters

  • "load_tag" - a string unique to this load request, used for incremental retries.
  • "table" - specifies which table in your dataset you wish to ingest this set of data into.
  • "updateStrategy" has three options:
    • "append" (this is the default, if not specified)
      Adds your new data as new rows. Key information is ignored – that is, if other rows exist in your table with the same primary key(s), this results in additional rows added with the same primary key(s). Primary key(s) are not necessary to append rows.
    • "replace"
      Looks for other row(s) with the same primary key(s) and soft deletes them as it ingests the new rows, effectively replacing the old row(s) with the new. Any rows that don't match an existing row's primary key will be added to the table. Use this update strategy to replace data with "null" values. You must provide the full row of new data in the "records" field of the request body, including the primary keys, when using the "replace" update strategy. The target table must have primary key(s) defined to use "replace" to update existing rows in a data table. 
    • "merge"
      Use if you want to update a few fields in a dataset.table.row while leaving the remaining values untouched. In a merge ingest, the caller can ingest partial records: any fields left unset fall back to the existing record matching its primary key(s). For each updated row, include the primary key and the columns you wish to update in the "records" field of the request body. The target table must have primary key(s) defined, and each updated row's key must match exactly 1 row in the target table. Do not use this update strategy to replace data with "null" values; instead, use "replace."
  • "resolve_existing_files" - tells TDR whether to allow files that may have already been ingested into your dataset without throwing an error. We suggest setting this to true; if you set it to false and try to ingest a file that you've previously ingested, TDR will throw an error. If set to true and this happens, TDR will not re-ingest the file and will use the existing file_id in the metadata. TDR performs a targetPath comparison to determine whether the file has already been ingested.
  • "records" - contains the metadata and file references to be ingested - see step 2.
  • "bulkMode" - loads data faster when ingesting a large number of files (e.g. more than 10,000 files) at once. The performance does come at the cost of some safeguards (such as guaranteed rollbacks and potential recopying of files) and it also forces exclusive locking of the dataset (i.e. you can’t run multiple ingests at once).
  • "format" - has three options for inputting records information:
    • "csv" - uses a separate file you write to your staging bucket. If you use this method, you should not include "records" but you do need to add a "path" parameter that includes the full path to the json file in the cloud. NOTE that CSV format ingests do not support concurrent file uploads; if you wish to use a CSV to ingest your metadata, you will need to first ingest your data files separately.
    • "array" is the easiest to use and is demonstrated here. With this option, you’ll provide all the metadata and file information (data records) as a nested JSON right in the request body - in the "records" parameter. If you use this, you will not specify the "path" parameter. While this solution is very convenient, for larger ingest jobs (e.g. more than 1000 records) it is recommended that you use the "json" or "csv" ingest formats.
    • "Json" - uses a separate file you write to your staging bucket. If you use this method, you should not include "records," but you will need to add a "path" parameter with the full path to the json file in the cloud. The format of the JSON object in this ingest method is newline delimited JSON (ND JSON).
  • Example nested JSON

    Inside the "records" field of your request body, you will create an array of json structures - one json structure for each row you want to ingest. See the example below for a nested JSON to populate data records (metadata and files to ingest) (the array and JSON options).

    Formatting a single data row's JSON

    Here's an example of metadata for a single row of data to be ingested into a TDR table.

    {
    "sample_id": "NA12878",
    "sample_type": "WGS",
    "processing_count": 2,
    "Collaborators": [
    ],
    "BAM_File_Path": {
    "sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
    "targetPath": "/file1/NA12878.bam",
    "description": "BAM for NA12878"
    }
    }

    What’s in this example row?

    This row of data contains information for four data fields: sample_id, sample_type, processing_count, and BAM_File_Path.

    • "sample_id" is likely the primary key of the table.
    • "sample_type" is a metadata field with datatype string.
    • "processing_count" is a metadata field with datatype int
    • "BAM_File_Path" is a metadata field with datatype fileref. here you are also giving TDR the information about the file itself to be ingested. When ingesting a file, be sure to include the two parameters:
      • "sourcePath" - The URL path to the file that column will point to (BAMs, VCFs, etc). If your data are staged in a Google bucket, you can find this URL via the Google cloud console storage browser. If your data are staged in an Azure container, use a signed URL (see instructions in Step 1). When you submit your ingest, TDR will copy the file at "sourcePath" into TDR storage
      • "targetPath" - an arbitrary location within your Data Repo virtual file system. This value must begin with a forward slash ("/"). TDR will use this path as an alias that will be stored as metadata along with a file_id that TDR generates in the DRS record for this particular file; the file_id is what will be shown in the TDR dataset metadata, and it can be used with TDR APIs to retrieve the DRS record for the file. When you export this field in a snapshot to a workspace or preview a snapshot’s data in TDR, it will render as a DRS URI link; the DRS entry can be queried to retrieve both the targetPath file alias you provided here at ingest and the cloud-specific file access URL.
      • [optional] "description" - if you wish to provide a description for the file, you may do so. This will also be included in the DRS entry for the file.

    Example records object (array)

    The full "records" object is an array of (if desired) many single-line JSON objects; in this example, we include data for three rows, corresponding to samples NA12878, NA12879, and NA12880:

    [
    {
    "sample_id": "NA12878",
    "sample_type": "WGS",
    "processing_count": 2,
    "BAM_File_Path": {
    "sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
    "targetPath": "/file1/NA12878.bam",
    "description": "BAM for NA12878"
    }
    },
    {
    "sample_id": "NA12879",
    "sample_type": "WGS",
    "processing_count": 1,
    "BAM_File_Path": {
    "sourcePath": "gs://data-repo-ingest-site/NA12879.unmapped.bam",
    "targetPath": "/file1/NA12879.bam",
    "description": "BAM for NA12879"
    }
    },
    {
    "sample_id": "NA12880",
    "sample_type": "WGS",
    "processing_count": 4,
    "BAM_File_Path": {
    "sourcePath": "gs://data-repo-ingest-site/NA12880.unmapped.bam",
    "targetPath": "/file1/NA12880.bam",
    "description": "BAM for NA12880"
    }
    }
    ]

1.2. Once you've assembled the records metadata, copy that into the "records" object in your base IngestDataset request body.

IngestDataset request body (array format)

{
"format": "array",
"load_tag": "/*ingest_to_example_table_2023-02-15_16-43*/",
"table": "/*example_table*/",
"updateStrategy": "append",
"resolve_existing_files": true,
"records": [
{
"sample_id": "NA12878",
"sample_type": "WGS",
"processing_count": 2,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
"targetPath": "/file1/NA12878.bam",
"description": "BAM for NA12878"
}
},
{
"sample_id": "NA12879",
"sample_type": "WGS",
"processing_count": 1,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12879.unmapped.bam",
"targetPath": "/file1/NA12879.bam",
"description": "BAM for NA12879"
}
},
{
"sample_id": "NA12880",
"sample_type": "WGS",
"processing_count": 4,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12880.unmapped.bam",
"targetPath": "/file1/NA12880.bam",
"description": "BAM for NA12880"
}
}
]
}

Troubleshooting JSON formatting Getting all of the brackets just right in a nested .JSON like this can be a little tricky! To avoid messing up the placement of a comma or a bracket, try a free online JSON validator.

If you're struggling to create a valid JSON, it may help to copy-paste the example code in the Swagger UI request body for that particular API. You can then make changes to the template incrementally while validating each change.

Finally, certain parameters - such as "tables", "relationships", and "assets" - are expected to be lists, so make sure you include square brackets: [ ].

Step 2. call the ingestDataset API

Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

2.1. Copy the entire request body into the "ingest request" field in the IngestDataset Swagger endpoint.

2.2. Make sure you've also copied the dataset_id into the "id" field.

2.3. Click "Execute".

When you ingest a file to TDR, several things happen      1. TDR copies the file into a TDR-managed bucket.
     2. TDR creates a unique file_id that's stored in the metadata for this field.
     3. When a snapshot is created that includes this field, TDR creates a DRS record for the file. 

The DRS URI can be used to retrieve metadata and access information about the file. Read more about DRS here.

  • A file reference is a link to a specific file's location in the cloud, which can be used to read or write data during an analysis. Referencing files from a TDR data table keeps your data organized and makes it easy to access those files in a downstream analysis.

    To add a file reference to a data table and upload the file to your TDR dataset at the same time, follow the "IngestDataset request body (array format)" example in Step 1. Set up the ingestDataset request body (JSON). When creating the data table, be sure to set the relevant column's type to fileref.

    To add a file reference to a data table without uploading the file to your TDR dataset:

    1. When creating the data table, set the relevant column's type to fileref

    2. Collect your file's URL or UUID, indicating its location in the cloud. See "How to find your file's URL or UUID" in Step 1. Set up the request body (JSON) to learn how to identify your file's URL or UUID.

    3. Ingest your data table using the ingestDataset API endpoint. In the records section of the request body, specify your file's signed URL, URI, or UUID as in the "file_reference" field in this example:

    {
    "format": "array",
    "load_tag": "adding file for subject 1",
    "table": "subject",
    "updateStrategy": "append",
    "resolve_existing_files": true,
    "records": [
    {
    "subjectID": "Sub1",
    "file_reference": "81157a66-c3b7-497e-9d2e-ec5b99ed0613"
    }
    ]
    }

What to expect

A successful request will return a 200 response that includes a job identifier ("id" in the response JSON). You can use the retrieveJob and retrieveJobResult endpoints to check that your job was completed successfully and troubleshoot any errors.

Once your job has finished running, you will also be able to see your tabular data on the TDR website:

1. Log into https://data.terra.bio/

2. In the datasets tab, click on your dataset’s name.

3. You’ll see a summary of your dataset, including files, tables, and their columns:

Screenshot of an example dataset on the Terra Data Repository's web interface.

4. Click on view dataset data to view your data tables.

Troubleshooting ingest errors

  • This suggests a problem with your primary keys. Check that you specified primary keys for the data you ingested, and that there are no duplicate rows with the same primary keys in your data.

Ingest Data Files

If you're only ingesting data files, follow the instructions in this section. If you're ingesting data files and data tables at the same time, follow the instructions in Ingest Data Tables.

Step 1. Set up the request body (JSON)

To ingest a single file, use the ingestFile API endpoint. To ingest multiple files, use the bulkFileLoad or bulkFileLoadArray API endpoints. 

ingestFile JSON

Format the ingestFile request body to ingest a single file as follows:

{
"source_path": "/*gs URI, signed URL, or TDR UUID for the file's location in the cloud*/",
"target_path": "/*full path to the file's ultimate location in the TDR dataset, starting with '/'*/",
"profileId": "/*your TDR dataset's default billing profile ID*/",
"description": "/*optional description of the file*/"
}

for example:

{
"source_path": "gs://fc-22d0c32d-c79a-4569-9922-425259dc49f9/data_files/subject1.bam",
"target_path": "/files/subject1.bam",
"profileId": "/*your TDR dataset's default billing profile ID*/",
"description": "subject 1 bam file"
}
  • If your file is already stored in TDR, use the listFiles API endpoint to list all of the files in your TDR dataset. Then, locate your file's UUID in the fileId field.

    If your file is stored outside of TDR in a Google-backed Terra workspace, locate your file in the Files section of the workspace's Data tab, then click on the clipboard icon to the right of the file name to copy its URI.

    If your file is stored in the Google cloud, outside of TDR and Terra, locate your file in the Google Cloud console. Click on the three-dot icon at the right-hand end of your file’s row and then click copy gsutil URI to copy the file’s URI to your clipboard.

    If your file is stored in the Azure cloud, follow the instructions in Step 1. Stage your data in the Cloud to generate a signed URL.

Step 2. call the API

Remember to authorize Swagger every time you use it See How to authenticate/troubleshoot Swagger for TDR for step-by-step instructions.

2.1. Copy the entire request body into the "ingest request" field in the ingestFile Swagger API endpoint.

2.2. Make sure you've also copied the dataset_id into the "id" field.

2.3. Click "Execute".

When you ingest a file to TDR, several things happen      1. TDR copies the file into a TDR-managed bucket.
     2. TDR creates a unique file_id that's stored in the metadata for this field.
     3. When a snapshot is created that includes this field, TDR creates a DRS record for the file. 

The DRS URI can be used to retrieve metadata and access information about the file. Read more about DRS here.

What to expect

A successful request will return a 200 response that includes a job identifier ("id" in the response JSON). You can use the retrieveJob and retrieveJobResult endpoints to check that your job was completed successfully and troubleshoot any errors.

Once your job has finished running, you will also be able to see your dataset's files by running the listFiles endpoint.

Next Steps

Once you've ingested data into your TDR dataset, you're off and running! Next, you can continue adding to and editing your dataset. Once the data are ready, you can Create a snapshot to share your data and analyze your data in a workflow.

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.