Ingesting data

Anton Kovalsky
  • Updated

This article includes instructions on using API commands through the Swagger UI. All instructions related to Swagger require you to first authenticate yourself whenever you’ve opened a window with the Swagger UI. Click “Authorize” near the top of the page, check all of the boxes in the pop up and hit “Authorize” again, and then input the appropriate credentials to authenticate. Make sure you close the subsequent pop up without clicking the “Sign Out” button. You should now be able to execute the commands below by clicking the “Try it out” button next to the command of your choice. For a more detailed description of this authentication step, see this article on Authenticating in Swagger.

Screen_Shot_2021-09-23_at_9.16.08_PM.png


 

Once you’ve created a dataset, you’re ready to start ingesting data. In order to populate the columns of your dataset, you'll need to first upload any files (e.g. BAM files) to the dataset’s virtual file system, and subsequently you'll need to upload a CSV table to populate the dataset's tables with the relevant metadata.

There are two parts to ingesting data to use in the Terra Data Repo: (A) Ingesting files containing genomic data, and (B) ingesting the metadata that accompanies those files (sample IDs, phenotypic information, etc) and makes those files visible in your Data Repo datasets and snapshots. You can do these steps separately or at the same time. Separating the steps can be useful especially when you're starting datasets from scratch, as it helps you test your schema, and allows you to add files in small batches to make sure the ingestion process works properly. Once your ingestion strategy is in place, you can start creating more complex JSON objects that will include the metadata and the bucket addresses for the files themselves all in one.

 

1. Ingesting individual files

2. Ingesting multiple files in bulk

3. Ingesting metadata tables

4. Combined file and metadata ingestion

 

0. Setup prerequisites

To do any of these, there are three prerequisites:

 

Step 0.1. Add the TDR service account to the permission list of your bucket:

 

2021-09-21_08-47-31.png

 

Step 0.2. Do the same thing for your personal proxy email, which you can find under your Terra profile

 

2022-02-14_10-55-49.png

 

1. To ingest individual files

To ingest individual files, use the ingestFile API endpoint with the following parameters in the command body:

  • “source_path”: The gs://[source path] of the object to ingest
  • “target_path”: The arbitrary path the file will occupy in the dataset’s virtual file system
  • “profileId”: Your spend profile ID
{
  "source_path": "gs://test-bucket/path/to/a/file.file_name",
  "target_path": "/metrics/file.file_name",
  "profileId": "/* the profile id you generated when you created your spend profile */",
"description": "A delicious file to ingest"
}

 

The response body that results from successfully submitting this request will include an “id” field that can be used to track the status of the job, using the retrieveJob API endpoint. The response body returned by a successful single-file ingest job will contain a “fileId” field. The value of this field is the ID of the file within the virtual filesystem of the dataset, which must be used to refer to the file from within the dataset’s tabular data. Use the file ID in the CSV you ingest in order to populate the tables of the dataset.

 

2. To ingest files in bulk

 

If you want to ingest many files at once, you also have multiple options:

  • The array-based method, which is simpler but requires bundling the entire request into a single JSON object.
  • The file-based method, which requires a little more effort, but is useful when the batch of files to ingest is so big that constructing a JSON describing them all is unreasonable, or the contents of the batch were computed by a program that outputs to Cloud Storage

Note that in either case, files that are successfully ingested will remain in the dataset, even if other files fail to be copied in. This behavior is intended to support retries of requests. If a bulk ingest job has reached a failed terminal state, re-submitting the exact same payload will cause the data repo to only attempt to re-ingest the files that failed in the previous attempt.

 

2.1. The array-based method

This method uses the bulkFileLoadArray API endpoint. In addition to the spend profile ID, the request body must contain:

  • “loadTag”: A unique identifier for the batch of files, used for incremental retries
  • “loadArray”: An array of individual file-ingest requests (this is just a comma-separated array of the same text in the request for single file ingest, multiplied by however many files you’re ingesting)
  • “maxFailedFileLoads”: An integer setting the number of failures that should be permitted before the bulk ingest is stopped (the default is 0)
{
"profileId": "/*the profile id you generated when you created your spend profile*/",
  "maxFailedFileLoads": 1,
  "loadTag": "my-test-array-load",
  "loadArray": [
    {
      "sourcePath": "gs://test-bucket/path/to/a/file.metrics",
      "targetPath": "/file1/file.metrics",
    "description": "Fake metrics file"
    },
    {
      "sourcePath": "gs://test-bucket/path/to/a/file.bam",
      "targetPath": "/file1/file.bam",
      "description": "Fake bam file"
    }
  ]
}

 

When you do it this way, there is an extra step needed to retrieve the file IDs you'll need to put into the CSV you later ingest in order to populate your dataset. For this step to work, you'll need to remember the string you used in the "loadTag" parameter above, which you determined yourself when you submitted the request to the bulkFileLoadArray API endpoint. You will then need that loadTag, along with the UUID for the dataset where you've ingested those files, to use the getLoadHistoryForLoadTag API endpoint. The response body of this API will contain per file summaries of each file ingested with that loadTag, along with file IDs for each one.

 

2.2. The file-based method

In this case, you’ll use the bulkFileLoad API endpoint, and instead of a “loadArray” parameter, the request body for this case expects a “loadControlFile”. The value for this key must be a gs:// path pointing to a newline-delimited-JSON file, with one file ingest request per line. The requests must be objects with the same keys as the array-based loading request. An equivalent request to the array-based example above is:

 

{
  "profileId": "7f377a47-6e22-4b14-9ad7-0668f21cad5f",
  "maxFailedFileLoads": 1,
  "loadTag": "my-test-array-load",
  "loadControlFile": "gs://staging-bucket/files.list"
}


Where the contents of the gs://staging-bucket/files.list file are:

{"sourcePath": "gs://test-bucket/path/to/a/file.metrics","targetPath":
"/file1/file.metrics","description": "Fake metrics file","mime_type": "text/plain"}
{"sourcePath": "gs://test-bucket/path/to/a/file.bam","targetPath":
"/file1/file.bam","description": "Fake bam file"}

The response body returned by a successful file-based bulk ingest job currently contains only summary information about the ingest. For file level detail on the ingest process, you’ll need to look at the associated BigQuery table through the console.

 

3. Ingesting metadata tables

Now that you have both your dataset as a template, and some files in the dataset's virtual file system, you can populate the dataset's columns by staging the desired metadata to a CSV spreadsheet or a newline-delimited JSON. The two options are mostly equivalent, except that it’s not possible to ingest array/repeated columns using CSV., and ingesting it using the ingestDataset API endpoint.

 

3.1 By CSV

Start by creating a spreadsheet with the following criteria:

  • It will need a leading row (a header) and a leading column (any place holder will do, but a column t skip from the left is necessary)
  • Make sure the column headers correspond to columns that were in the schema you used to create the dataset
  • Make sure that the metadata you put in each cell corresponds to the datatype you set for that column according to your schema
  • To point to any files you've ingested in the previous steps, paste the file ID corresponding to the "fileId" field returned in the response body of those APIs (or from the retrieveJobResults API or the getLoadHistoryForLoadTag API, as described in the above sections). If you want these to show up as links to cloud paths when snapshots from this dataset are exported to Terra workspaces, make sure the column the file IDs are going into had its data type set as  "fileref" in the schema when the dataset was created.

Screen_Shot_2021-09-21_at_12.44.36_PM.png

 

Upload this to a Google bucket location to which you've given the Terra Data Repo's service account (datarepo-jade-api@terra-datarepo-production.iam.gserviceaccount.com) the role of a Storage Object Viewer, as described at the beginning of this article.

 

Next, go the ingestDataset API endpoint, enter the UUID for the dataset to which you're ingesting, and execute the API request with a .JSON like the one below in the request body:

{
"format": "csv",
"load_tag": "My CSV ingest",
"path": "gs://example-bucket/inputs/*.tsv",
"table": "example_table",
"max_bad_records": 0,
"csv_allow_quoted_newlines": true,
"csv_quote": "|",
"csv_null_marker": "NA",
"csv_skip_leading_rows": 1
"profile_id": "/* the profile id you generated when you created your spend profile */"
}

Some things to note:

  • The "format" MUST be set to "csv"
  • The ""csv_skip_leading_rows" parameter is set to "1" in this example because we want to skip the leading row (and column) of the example spreadsheet we described above. This is the current recommended way of accomplishing this style of ingest
  • The "path" parameter should be set to the "gs://" bucket path (aka the gsutil URI) to the CSV file in whichever Google bucket you stored it (again, find this URL in the Google cloud console storage browser):

2021-09-23_02-02-55.png

 

3.2 By JSON

If you're programmatically generating metadata information, there's a good chance that output comes a JSON format. The process of ingesting metadata by JSON is essentially identical as by CSV. You use the same , with a typical request body as shown here:

{
  "format": "json",
  "load_tag": "My JSON ingest",
  "path": "gs://example-bucket/inputs/*.json",
  "table": "example_table",
  "max_bad_records": 0,
"profile_id": "/*the profile id you generated when you created your spend profile*/"

}

 

The JSON should be a newline-delimited JSON. Another advantage of the JSON approach is that it allows you to ingest nested arrays of columns. Section 4 below outlines how this can be used to ingest files and metadata simultaneously. For the simpler example of using a JSON to ingest just metadata, the JSON that would be equivalent to the CSV example above would like this:

{"datarepo_row_id": "abc123", "sample_id": "NA12878", "BAM_File_Path": "*fileID*"}

As before, the *fileID* is the the identification number for the already-ingested file. If your schema has this column (BAM_File_Path in this example) set as "fileref" data type, and you've successfully ingested the file you wish to point toward, this field will render as a GCP bucket hyperlink when you export a snapshot with this row to a workspace. You can get the fileID from the result body of the API you used to ingest the file, or from the retrieveJobResults API or the getLoadHistoryForLoadTag API, as described above.

 

4. Combined file and metadata ingestion

If want to ingest files and metadata at the same time, using only one API request to populate a dataset with new rows that include paths to files stored in Google buckets, you can do this using the same ingestDataset API endpoint as described in section 3, only with slightly more structured JSON. You would follow the exact same procedure as described above in section 3.2, creating a newline-delimited JSON, uploading it to GCP, finding the URL path to the file in the Google cloud console storage browser, and using the exact same request body as shown in section 3.2.

The only difference is that the JSON you'll upload to GCP will have a nested structure, with the column designated as your "fileref" data type column containing a full JSON object as its input. This second JSON contains four fields:

  • "sourcePath": This is URL path to the file that column will point to (BAMs, VCFs, etc), which you also find via the Google cloud console storage browser
  • "targetPath": This is an arbitrary location within your Data Repo virtual file system
  • "description": An arbitrary string of characters to help you keep track of the file
  • "mimeType": This field is required because of how GCP works. It can be set to "text/plain" in this JSON. So long as the data type of this column is set as "fileref" in the dataset's schema, the column cell will render correctly when a snapshot containing it is exported to a workspace

The only other thing to note about this nest JSON is that, since the format is a newline-delimited JSON, the whole JSON needs to be a single line with no line breaks, as shown below (scroll right in the sample code field to see the full example):

{"sample_id": "NA12878", "BAM_File_Path": {"sourcePath": "gs://data-repo-ingest-site/NA12878.unmapped.bam", "targetPath": "/file1/NA12878.bam", "description": "BAM for ingesting", "mimeType": "text/plain"}}

 

This means that if you're creating your JSONs for ingestion programmatically, you need to avoid creating line breaks. You can also do this manually using a CSV-to-JSON converter like this one. Do this in two steps:

(1) Create a CSV with just the fileref column's inputs, upload it to the converter and click "Convert". Note there are some convenient settings available in this converter, such as the "Minify" setting, which will generate a JSON without all of the line breaks. Also be careful about the Array/Hash settings. The Array setting will include an extra pair of [square brackets] which may cause problems, as the the API specifically expects non-array JSON objects in certain cases:

 

2021-10-04_14-05-21.png

 

(2) Create anther CSV table corresponding to the table of the dataset where you're ingesting, and paste that JSON object into the cell corresponding to the fileref column. Now upload this CSV to the converter and click "Convert". You can see the the nested JSON object within the "BAM_File_Path" field in the example below:

 

2021-10-04_14-08-18.png

 

Create a JSON file by pasting this JSON into a text editor and saving the file with a ".json" extension. Now you can upload this file to your GCP storage, and use the URL as the path in the ingestDataset API endpoint.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.