Learn how to ingest data into the Terra Data Repository (TDR). Read on for instructions for using the Swagger UI to ingest individual files or files in bulk, and how to combine files and metadata .
Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger endpoint UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.
Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.
You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.
Ingesting overview
Once you’ve created a dataset, you’re ready to start ingesting data files (such as CRAM or FASTQ files) and populate the columns of your dataset’s tables. Ingesting makes data files available in your Data Repo datasets for snapshotting.
There are two parts to ingesting data to use in the Terra Data Repo
- Upload the genomic data files (e.g., BAM files) to the virtual file system
- Populate the dataset's tables with associated metadata (sample IDs, data file IDs, links to genomic data files, etc.)
The Terra Data Repo combines these steps with a single ingest job.
Before you start - Setup prerequisites
Stage data in a Google bucket
Before TDR can ingest data, it needs to be staged in Cloud Storage (a Google Bucket or Azure blob container). TDR will use the bucket or container UUID to find and copy the data into TDR storage.
- External Google Bucket
- Workspace storage
Parameter values you will need in this step
- The billing profile UUID (generated when creating your spend profile
You can retrieve the billing profile UUID by executing the enumerateProfiles endpoint). - The dataset UUID (generated during the createDataset step
You can retrieve the dataset UUID by executing the enumerateDatasets endpoint). - The TDR service account associated with your dataset
This account must have the role Storage Object Viewer on any GCP buckets used as a source for ingests (see instructions below). - The UUID for the bucket where the data to be ingested is staged
Note the updated service account procedureDatasets may have their own dedicated service accounts for ingestion. Previously, all datasets used the global TDR service account (datarepo-jade-api@terra-datarepo-production.iam.gserviceaccount.com
).
How to set up the dataset's service account (prerequisite 3)
1. Retrieve the service account
Use the retrieveDataset endpoint to get the service account (in the “ingestServiceAccount” field of the response body) to use for ingestion. It will either be the global TDR service account or a dedicated TDR service account.
Alternatively, you can find the service account on the dataset summary tab in the TDR.
2. Grant permissions on the bucket
You need to give TDR permission to access the bucket where the data is staged. How you do this depends on whether the data is in Terra workspace storage or an external bucket.
-
2.1. Share the workspace (Reader access) with the service account.
-
2.1. Go to the Google Cloud console storage browser and grant the service account the necessary permissions on the data bucket.
Role > Storage Object Viewer
2.2. Add your personal proxy email for your Terra profile's pet service account to the data bucket permissions (Role > Storage Object Viewer).
When using the generic TDR service account (caveats)
Set the dedicatedIngestServiceAccount
parameter to "false" in the dataset create endpoint.
If dedicatedIngestServiceAccount=true
, the general TDR service account will NOT work.
Ingest files and metadata together
To ingest files and metadata at the same time, you can use a single ingestDataset API endpoint request. This API first ingests the data files and then populates a table with new rows that include paths to the ingested files (DRS URIs).
Note that you'll need to make one request per table of data ingested.
Step 1. Set up base IngestDataset request body (JSON)
The base of the request body should be formatted as follows. Note that you’'ll format the data records (the metadata and data files to be ingested) in the next step.
ingestDataset JSON
{
"format": "array",
"load_tag": "/*ingest_to_example_table_2023-02-15_16-43*/",
"table": "/*example_table*/",
"updateStrategy": "append",
"resolve_existing_files": true,
"records": [/*{DATA RECORDS}*/],
"bulkMode": false
}
JSON parameters
- "load_tag" - a string unique to this load request, used for incremental retries.
- "table" - specifies which table in your dataset you wish to ingest this set of data into.
- "updateStrategy" has three options:
- "append" (this is the default, if not specified)
Adds your new data as new rows. Key information is ignored – that is, if other rows exist in your table with the same primary key(s), this results in additional rows added with the same primary key(s). - "replace"
Looks for other row(s) with the same primary key(s) and soft deletes them as it ingests the new rows, effectively replacing the old row(s) with the new. The target table must have primary key(s) defined. The source data must not contain duplicate primary key values. - "merge"
Use if you want to update a few fields in a dataset.table.row while leaving the remaining values untouched. In a merge ingest, the caller can ingest partial records: any fields left unset fall back to the existing record matching its primary key(s). The target table must have primary key(s) defined. Each ingest row must have its primary key(s) specified, and match exactly 1 row in the target table. The source data must not contain duplicate primary key values.
- "append" (this is the default, if not specified)
- "resolve_existing_files" - tells TDR whether to allow files that may have already been ingested into your dataset without throwing an error. We suggest setting this to true; if you set it to false and try to ingest a file that you've previously ingested, TDR will throw an error. If set to true and this happens, TDR will not re-ingest the file and will use the existing file_id in the metadata. TDR performs a targetPath and checksum comparison to determine whether the file has already been ingested.
- "records" - contains the metadata and file references to be ingested - see step 2.
- "bulkMode" - loads data faster when ingesting a large number of files (e.g. more than 10,000 files) at once. The performance does come at the cost of some safeguards (such as guaranteed rollbacks and potential recopying of files) and it also forces exclusive locking of the dataset (i.e. you can’t run multiple ingests at once).
- "format" - has three options for inputting records information:
- "csv" - uses a separate file you write to your staging bucket. If you use this method, you should not include "records" but you do need to add a "path" parameter that includes the full path to the json file in the cloud. NOTE that CSV format ingests do not support concurrent file uploads; if you wish to use a CSV to ingest your metadata, you will need to first ingest your data files separately.
- "array" is the easiest to use and is demonstrated here. With this option, you’ll provide all the metadata and file information (data records) as a nested JSON right in the request body - in the "records" parameter. If you use this, you will not specify the "path" parameter. While this solution is very convenient, for larger ingest jobs (e.g. more than 1000 records) it is recommended that you use the "json" or "csv" ingest formats.
- "Json" - uses a separate file you write to your staging bucket. If you use this method, you should not include "records," but you will need to add a "path" parameter with the full path to the json file in the cloud. The format of the JSON object in this ingest method is newline delimited JSON (ND JSON).
-
Example nested JSON
Inside the "records" field of your request body, you will create an array of json structures - one json structure for each row you want to ingest. See the example below for a nested JSON to populate data records (metadata and files to ingest) (the array and JSON options).
Formatting a single data row's JSON
Here's an example of metadata for a single row of data to be ingested into a TDR table.
{
"sample_id": "NA12878",
"sample_type": "WGS",
"processing_count": 2,
"Collaborators": [
],
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
"targetPath": "/file1/NA12878.bam",
"description": "BAM for NA12878"
}
}What’s in this example row?
This row of data contains information for four data fields: sample_id, sample_type, processing_count, and BAM_File_Path.
- "sample_id" is likely the primary key of the table.
- "sample_type" is a metadata field with datatype string.
- "processing_count" is a metadata field with datatype int
- "BAM_File_Path" is a metadata field with datatype fileref. here you are also giving TDR the information about the file itself to be ingested. When ingesting a file, be sure to include the two parameters:
- "sourcePath" - The URL path to the file that column will point to (BAMs, VCFs, etc), which you can find via the Google cloud console storage browser. When you submit your ingest, TDR will copy the file at "sourcePath" into TDR storage
- "targetPath" - an arbitrary location within your Data Repo virtual file system. This value must begin with a forward slash ("/"). TDR will use this path as an alias that will be stored as metadata along with a file_id that TDR generates in the DRS record for this particular file; the file_id is what will be shown in the TDR dataset metadata, and it can be used with TDR APIs to retrieve the DRS record for the file. When you export this field in a snapshot to a workspace or preview a snapshot’s data in TDR, it will render as a DRS URI link; the DRS entry can be queried to retrieve both the targetPath file alias you provided here at ingest and the cloud-specific file access URL.
- [optional] "description" - if you wish to provide a description for the file, you may do so. This will also be included in the DRS entry for the file.
Example records object (array)
The full "records" object is an array of (if desired) many single-line JSON objects; in this example, we include data for three rows, corresponding to samples NA12878, NA12879, and NA12880:
[
{
"sample_id": "NA12878",
"sample_type": "WGS",
"processing_count": 2,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
"targetPath": "/file1/NA12878.bam",
"description": "BAM for NA12878"
}
},
{
"sample_id": "NA12879",
"sample_type": "WGS",
"processing_count": 1,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12879.unmapped.bam",
"targetPath": "/file1/NA12879.bam",
"description": "BAM for NA12879"
}
},
{
"sample_id": "NA12880",
"sample_type": "WGS",
"processing_count": 4,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12880.unmapped.bam",
"targetPath": "/file1/NA12880.bam",
"description": "BAM for NA12880"
}
}
]
1.2. Once you've assembled the records metadata, copy that into the "records" object in your base IngestDataset request body.
IngestDataset request body (array format)
{
"format": "array",
"load_tag": "/*ingest_to_example_table_2023-02-15_16-43*/",
"table": "/*example_table*/",
"updateStrategy": "append",
"resolve_existing_files": true,
"records": [
{
"sample_id": "NA12878",
"sample_type": "WGS",
"processing_count": 2,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam",
"targetPath": "/file1/NA12878.bam",
"description": "BAM for NA12878"
}
},
{
"sample_id": "NA12879",
"sample_type": "WGS",
"processing_count": 1,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12879.unmapped.bam",
"targetPath": "/file1/NA12879.bam",
"description": "BAM for NA12879"
}
},
{
"sample_id": "NA12880",
"sample_type": "WGS",
"processing_count": 4,
"BAM_File_Path": {
"sourcePath": "gs://data-repo-ingest-site/NA12880.unmapped.bam",
"targetPath": "/file1/NA12880.bam",
"description": "BAM for NA12880"
}
}
]
}
Step 2: Call the IngestData API
2.1. Copy the entire request body into the "ingest request" field in the IngestDataset Swagger endpoint.
2.2. Make sure you've also copied the dataset_id into the "id" field.
2.3. Click "Execute".
When you ingest a file to TDR, several things happen 1. TDR copies the file into a TDR-managed bucket.
2. TDR creates a unique file_id that's stored in the metadata for this field.
3. When a snapshot is created that includes this field, TDR creates a DRS record for the file.
The DRS URI can be used to retrieve metadata and access information about the file. Read more about DRS here.
What to expect
A successful request will return a 200 response that includes a job identifier ("id" in the response JSON). You can use the retrieveJob and retrieveJobResult endpoints to check that your job was completed successfully and troubleshoot any errors.