How to ingest data into the Terra Data Repo

Anton Kovalsky
  • Updated

Learn how to ingest data into the Terra Data Repository (TDR). Read on for instructions for using the Swagger UI ingest individual files, files in bulk, or how you can combine files and metadata . 

Once you’ve created a dataset, you’re ready to start ingesting data and populate the columns of your dataset table. Ingesting data make those files visible in your Data Repo datasets and snapshots. 

There are two parts to ingesting data to use in the Terra Data Repo:

  • Step 1: Uploading the genomic data files (e.g., BAM files) to the virtual file system
  • Step 2: Populating the dataset's tables with associated metadata (sample IDs, data file IDs, links to genomic data files, etc.)

You can do these steps separately or at the same time

Separating the steps can be useful when you're starting datasets from scratch. It helps you test your schema, and lets you add files in small batches to make sure the ingestion process works properly. Once your ingestion strategy is in place, you can start creating more complex JSON objects that include the metadata and the bucket addresses for the files themselves all in one.

Step 0. Before you start - Setup prerequisites

Remember to authorize Swagger every time you use it This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI.

Instructions
Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button.

You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

Prerequisites

To follow the ingestion steps below, you need to gather three prerequisites.

  1. Find the profile ID you generated when creating your spend profile.
  2. Find the UUID you generated during the createDataset step.
  3. Add TDR service account to the dataset bucket permissions with the Role Storage Object Viewer using your Google Cloud console storage browser.

Note the updated service account procedureDatasets can now create their own dedicated service accounts for ingestion. They no longer need to use the global TDR service account (datarepo-jade-api@terra-datarepo)

Previously, you had to grant the TDR service account "Storage Object Viewer" access to the bucket from which you're ingesting data. Now, you can use the dataset's dedicated service account.

  • 3.1 Use the retrieve dataset endpoint to get the dedicated service account (NEW!) to use for ingestion.

    3.2. Add the dedicated service account to the data bucket permissions (Role > Storage Object Viewer).
    2022-02-14_10-55-49.png

  • 3.1. Set the dedicatedIngestServiceAccount parameter to "false" in the dataset create endpoint.
    If dedicatedIngestServiceAccount=true, the TDR service account will NOT work.

    3.2. If dedicatedIngestServiceAccount=false, then you can can add the TDR service account (datarepo-jade-api@terra-datarepo)to the data bucket as before (Role > Storage Object Viewer).

    3.3. Add your personal proxy email for your Terra profile's pet service account to the data bucket permissions (Role > Storage Object Viewer).
    2022-02-14_10-55-49.png

Step 1. Ingest data files

When you are done ingesting the data files, you will move to Step 2: Ingesting metadata tables (filling out table columns). 

Note that you can combine these two steps. See instructions here

Step 1.1. Ingest individual data files (option 1)

Use the ingestFile API endpoint.

ingestFile request body

{
  "source_path": "gs://test-bucket/path/to/a/file.file_name",
  "target_path": "/metrics/file.file_name",
  "profileId": "/* the profile id you generated when you created your spend profile */",
"description": "A delicious file to ingest"
}

Request body parameters

  • source_path: The gs://[source path] of the object to ingest
  • target_path: The arbitrary path the file will occupy in the dataset’s virtual file system
  • profileId: Your spend profile ID

Response body from successful request will include

  • An “id” field that can be used to track the status of the job, using the retrieveJob API endpoint.
  • A single file ingest “fileID” field.

The fileID field

The value of this field is the ID of the file within the virtual filesystem of the dataset, which refers to the file from within the dataset’s tabular data.

You will use the fileID in the CSV you ingest to populate the tables of the dataset (step 2).

Step 1.2. Ingest files in bulk (option 2)

To ingest many files at once, you have two options:

1.2.1. An array-based method (using a single line JSON) 

1.2.2. A file-based method (using a newline-delimited JSON load file).

Array-based versus file-based: pros and consArray-based
- Using an array in a single line JSON is simpler
- but requires bundling the entire request into a single JSON object.

Using a load file
- Requires a little more effort up front (to create the upload file) - Useful when either the batch of files to ingest is so big that constructing a JSON describing them all is unreasonable or the contents of the batch were computed by a program that outputs to Cloud Storage.

Note: In either case, files that are successfully ingested will remain in the dataset, even if other files fail to be copied in. This is intended to support retries of requests. If a bulk ingest job reaches a failed terminal state, and the exact same payload is resubmitted, the data repo will attempt to re-ingest only those files that failed in the previous attempt.

1.2.1: Ingesting in bulk using the array-based method

Use the bulkFileLoadArray API endpoint.

bulkFileLoadArray request body

{
"profileId": "/*the profile id you generated when you created your spend profile*/",
  "maxFailedFileLoads": 1,
  "loadTag": "my-test-array-load",
  "loadArray": [
    {
      "sourcePath": "gs://test-bucket/path/to/a/file.metrics",
      "targetPath": "/file1/file.metrics",
    "description": "Fake metrics file"
    },
    {
      "sourcePath": "gs://test-bucket/path/to/a/file.bam",
      "targetPath": "/file1/file.bam",
      "description": "Fake bam file"
    }
  ]
}

Request body parameters

  • loadTag: A unique identifier for the batch of files, used for incremental retries
  • loadArray: An array of individual file-ingest requests (this is just a comma-separated array of the same text in the request for single file ingest, multiplied by however many files you’re ingesting)
  • maxFailedFileLoad: An integer setting the number of failures that should be permitted before the bulk ingest is stopped (the default is 0)

Extra step to retrieve fileIDs to populate the dataset

When you do it this way, there is an extra step to retrieve the fileIDs to populate your dataset. For this step to work, you'll need to remember the string you used in the "loadTag" parameter above, which you determined when you submitted the request to the the bulkFileLoadArray API endpoint.

Then, you will need that loadTag, along with the UUID for the dataset where you've ingested those files, to use the getLoadHistoryForLoadTag API endpoint.

What to expect: response body

The response body of this API will contain per file summaries of each file ingested with that loadTag, along with fileIDs for each one.

1.2.2: Ingesting in bulk using the file-based method

  • Use the bulkFileLoad API endpoint
  • Instead of a “loadArray” parameter, the request body for this case expects a “loadControlFile
  • The value for this key must be a gs:// path pointing to a newline-delimited-JSON file, with one file ingest request per line.
  • The requests must be objects with the same keys as the array-based loading request. 

bulkFileLoad request body

{
  "profileId": "7f377a47-6e22-4b14-9ad7-0668f21cad5f",
  "maxFailedFileLoads": 1,
  "loadTag": "my-test-array-load",
  "loadControlFile": "gs://staging-bucket/files.list"
}

Where the contents of the gs://staging-bucket/files.list file are:

{"sourcePath": "gs://test-bucket/path/to/a/file.metrics","targetPath":
"/file1/file.metrics","description": "Fake metrics file","mime_type": "text/plain"}
{"sourcePath": "gs://test-bucket/path/to/a/file.bam","targetPath":
"/file1/file.bam","description": "Fake bam file"}

Response body

  • Currently, the response body returned by a successful file-based bulk ingest job contains only summary information about the ingest.
  • For file-level detail on the ingest process, look at the associated BigQuery table through the console.

Step 2. Ingest metadata tables

Now that you have both your dataset as a template, and some files in the dataset's virtual file system, you can populate the columns in the dataset's table by staging the desired metadata 1) to a CSV spreadsheet or 2) a newline-delimited JSON.

Option 1 (CSV) | Option 2 (JSON)

Nested data arrays must use the JSON format

The two options are mostly equivalent, except that it’s not possible to ingest array/repeated columns using CSV. Nested data such as arrays within columns should be described using the JSON format, and ingested using the ingestDataset API endpoint.

Step 2.1. Ingesting metadata using a CSV (option 1)

2.1.1. Create a CSV with the metadata

  • a. Create a leading row (a header) and a blank leading column (any placeholder will do).

    b. Skipping the leading column, make sure the column headers correspond to columns in the schema used to create the dataset.

    c. Make sure the metadata in each cell corresponds to the datatype you set for that column according to your schema.

    d. To point to any files you've ingested in the previous steps, paste the fileID corresponding to the fileID field returned in the response body of those APIs (or from the retrieveJobResults API or the getLoadHistoryForLoadTag API, from the above sections).

    e. Do you want these files to show up as links to cloud paths when snapshots from this dataset are exported to Terra workspaces?  Make sure the column into which the fileIDs go had its data type set as  "fileref" in the schema when the dataset was created.

    datarepo_row_id sample_id BAM_FilePath
    abc123 NA12878 77cf5940-b2e9-485f-8304-a2e3acf55591

2.1.2. Upload the CSV to a Google bucket location.

2.1.3. Give the service account the role of Storage Object Viewer on the bucket.

  • dedicated service account: use retrieve dataset endpoint to retrieve
  • generic TDR service account: use datarepo-jade-api@terra-datarepo

2.1.4. Go to the ingestDataset API endpoint, and enter the UUID for the dataset you're ingesting.

2.1.6. Execute the API request with a .JSON (see example below) in the request body.

IngestDataset Request body (json)

{
"format": "csv",
"load_tag": "My CSV ingest",
"path": "gs://example-bucket/inputs/*.tsv",
"table": "example_table",
"max_bad_records": 0,
"csv_allow_quoted_newlines": true,
"csv_quote": "|",
"csv_null_marker": "NA",
"csv_skip_leading_rows": 1
"profile_id": "/* the profile id you generated when you created your spend profile */"
}

Parameter requirements 

  • The "format" MUST be set to "csv".
  • To skip the leading row and column (see example spreadsheet), the "csv_skip_leading_rows" parameter is set to "1" in this example. This is the current recommendation for  this style of ingest.
  • The "path" parameter should be set to the "gs://" bucket path"(aka, the gsutil URI) for the CSV file in whichever Google bucket you stored it.

URL in Google cloud console storage browser

2021-09-23_02-02-55.png

Step 2.2. Ingest metadata using a JSON (option 2)

If you're generating metadata information programmatically, there's a good chance the output comes in a JSON format. Ingesting metadata by JSON is essentially identical to ingesting by CSV. You use the same ingestDataset API endpoint.

For a simpler example of using a JSON to ingest just metadata, the JSON equivalent to the CSV example above would look like this:

{"datarepo_row_id": "abc123", "sample_id": "NA12878", "BAM_File_Path": "*fileID*"}

Request body details 

The JSON should be a newline-delimited JSON. An advantage of the JSON approach: It lets you ingest nested arrays in columns. Section 4 below outlines how to use this to ingest files and metadata simultaneously.

  • As before, the "*fileID*" is the identification number for the already-ingested file.
  • If your schema has this column (BAM_File_Path in this example) set as "fileref" data type, and you've successfully ingested the file you wish to point toward, this field will render as a Google cloud bucket hyperlink when you export a snapshot with this row to a workspace.
  • You can get the fileID from the result body of the API you used to ingest the file, or from the retrieveJobResults API or the getLoadHistoryForLoadTag API, as described above.

IngestDataset request body (json)


  "format": "json",
"load_tag": "My JSON ingest"
  "path": "gs://example-bucket/inputs/*.json",
"table": "example_table",
  "max_bad_records": 0,
"profile_id": "/*the profile id you generated when you created your spend profile*/"
}

Alternative: Ingest files and metadata together

To ingest files and metadata at the same time, use only one API request to populate a dataset with new rows that include paths to files stored in Google buckets. You'll use the same ingestDataset API endpoint (above), with a slightly more structured JSON. 

What's different?

The only difference is that the JSON you'll upload to Google Cloud will have a nested structure, with the column designated as your "fileref" data type column containing a full JSON object as its input.

Instructions

1. First create a newline-delimited JSON. For step-by-step instructions, click here

2. Upload the JSON to Google Cloud.

3. Find the URL path to the file in the Google cloud console storage browser.

4. Then use the exact same request body as shown in Step 2 option 2.

IngestDataset request body (json)


  "format": "json",
  "load_tag": "My JSON ingest"<
  "path": "gs://example-bucket/inputs/*.json",
"table": "example_table",
  "max_bad_records": 0,
"profile_id": "/*the profile id you generated when you created your spend profile*/"
}

Fields in the JSON

  • sourcePath: The URL path to the file that column will point to (BAMs, VCFs, etc), which you also find via the Google cloud console storage browser
  • targetPath: An arbitrary location within your Data Repo virtual file system
  • description: An arbitrary string of characters to help you keep track of the file
  • mimeType: This field is required because of how Google Cloud works. It can be set to "text/plain" in this JSON. So long as the data type of this column is set as "fileref" in the dataset's schema, the column cell will render correctly when a snapshot containing it is exported to a workspace

Formatting the single-line JSON

Note: Since the format is a newline-delimited JSON, the whole JSON needs to be a single line with no line breaks, as shown below (scroll right in the sample code field to see the full example):

{"sample_id": "NA12878", "BAM_File_Path": {"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam", "targetPath": "/file1/NA12878.bam", "description": "BAM for ingesting", "mimeType": "text/plain"}}

If you're creating your JSONs for ingestion programmatically, avoid creating line breaks. You can do this manually using a CSV-to-JSON converter like this one.

To avoid line-breaks in the JSON with a CSV to JSON converter

  1. Create a CSV with just the fileref column's inputs, upload it to the converter and click "Convert".
    Note: There are some convenient settings available in this converter, such as the "Minify" setting, which will generate a JSON without all of the line breaks. Be careful about the Array/Hash settings. The Array setting will include an extra pair of [square brackets] which may cause problems, as the API specifically expects non-array JSON objects in certain cases.
    2021-10-04_14-05-21.png

  2. Create another CSV table corresponding to the table of the dataset where you're ingesting, paste that JSON object into the cell corresponding to the fileref column. Now, upload this CSV to the converter and click "Convert".
    2021-10-04_14-08-18.png
    Nested JSON object within the "BAM-File-Path"

  3. Create a JSON file by pasting this JSON into a text editor and saving the file with a ".json" extension.

Now you can upload this file to your Google Cloud storage, and use the URL as the path in the ingestDataset API endpoint.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.