Step 5 - Ingest data to TDR

Allie Cliffe
  • Updated

Finally, you'll push data in the staging workspace to the linked TDR dataset. If you choose, you can clean up the staging workspace. 

5.1. Grant TDR permission to read data in external Buckets

For a TDR dataset to reference data file objects that don't exist in its own managed cloud storage, the dataset's service account needs read access to the bucket(s) where those data file objects live. Check the correct location for step-by-step instructions.

Who can skip this step?You must take this step unless your data object files are stored in the staging workspace Bucket. If your data file objects will be stored in the staging workspace Bucket, you can proceed directly to 5.2. Push data to TDR.

To find your TDR dataset's service account

Go to Workspace Data in the staging workspace (orange arrow in the screenshot below) and look for the value of the "data_ingest_sa" key. You can copy it into the clipboard to use in the steps below.

TDR-self-service-workspace_Service-account-in-workspace-data.png

Step-by-step instructions

Follow the directions in the tab corresponding to where your data object files are stored (in a different Terra workspace Bucket or in external GCS). 

  • To provide the service account with access to a different Terra workspace Bucket, share the other workspace with the service account (Reader access) in Terra by clicking the three vertical dot action icon at the top right.

    Screenshot of Share Workspace popup with user email field filled in with
    service account as reader and an arrow pointing to the blue Add button. 
    TDR-self-serve-ingest_Screenshot-of-sharing-non-staging-workspace Bucket-with-self-serve_service-account.png

    Screenshot of Share Workspace popup with service account
    (tdr-ingest-sa@datarepo-e507ec2a.gserviceaccount.com) added as
    a Reader and an arrow pointing to the blue Save button at the bottom right. TDR-self-service-ingest_Screenshot-of-share-workspace-popup-with-tdr-ingest-sa-added.png

  • 1. To provide the service account with access to an external GCS bucket, navigate to the Bucket in the Google Cloud console and grant the service account the role of Storage Object Viewer on the bucket (click on the Add Principal button in the Permissions for dataset form).

    2. To make sure you can push the data to TDR properly, you’ll need to also grant the proxy email for your Terra account the role of Storage Object Viewer on the bucket.

    Screenshot of GCS permissions in Google Cloud console with an arrow
    pointing to the Add principal button at right and a circle around the poxy
    email and service account listed under Storage Object Viewer roles.
    TDR-self-service-workspace_Storage-object-viewer-role-in-GCP-console.png

    To find the proxy email for your Terra account, go to your Terra profile and look for "Proxy Group".

5.2. Push data to TDR

Step-by-step instructions

5.2.1. Open the TerraWorkspaceTableToTDRIngest WDL in the Workflows page of the staging workspace. 

5.2.2. Enter the tables to push into the terra_tables variable in the workflow configuration as a double-quotation enclosed, comma-separated list of tables with no spaces between them (e.g., "table_1,table_2").

TaskName Variable Type Input value
GCPWorkspaceToDatasetIngest terra_tables String

"sample,subject,file_manifest"

5.2.3. Change any other parameters as desired, based on the README content of the workflow. In general, this workflow has been pre-configured such that you should not need to change any variables, particularly not the first time through.

5.2.4. Ensure Run workflow with inputs defined by file paths is selected and Save the configuration. Then click the blue Launch button to the right of the Outputs tab to kick off the workflow.

5.2.5. If the workflow fails, follow the three steps in How to troubleshoot failed workflows to access the error logs.

Failed workflows common messages and causesIf you see the error message below, it may be that you are not a user on the TDR Billing profile for the target dataset.
Job GCPWorkspaceToDatasetIngest.IngestWorkspaceDataToDataset:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.

What to do
Contact the TDR dataset owner and ask to be added to the billing profile.

5.2.6. Review the data within TDR. To review the dataset once it has been pushed to TDR, navigate to TDR in a browser and look for your dataset.

  • The identifier for your dataset will be recorded in the Workspace Data for the staging workspace, as the value associated with the "dataset_id" key.
  • From within TDR, you should be able to review the schema and click on "View Dataset Data" to actually look at the data that have been pushed into TDR.

5.3. (optional) - Clean up Staging Workspace

Once you've reviewed the data in TDR, use the CleanUpStagingWorkspace workflow to clean up the staging workspace.

What does the CleanUp workflow do?

  • For self-hosted TDR datasets (where data files remain in the staging workspace but are being referenced by TDR), this will remove files in the staging workspace that are not being referenced by the TDR dataset.
  • For TDR-hosted TDR datasets (where data files have been physically copied into TDR), this will remove files in the staging workspace that are already present in the TDR dataset to avoid paying for two copies of data.

Step-by-step instructions

5.3..1. Navigate to the CleanUpStagingWorkspace workflow and confirm the billing_project, workspace_name, and dataset_id variables in the workflow configuration are pointing to the proper staging workspace and TDR dataset, respectively.

5.3..2. Update the output_file variable in the workflow configuration to the full GCS path to a TSV that will be created by the workflow containing a list of files to delete. The WDL will write to the workspace Bucket by default. This variable allows you to specify a directory path to write to. 

5.3.3. Update the files_paths_to_ignore variable to a comma-separated list of file paths to ignore for deletion. This is included for flexibility only, and may not typically used. 

5.3.4. Update the google_project variable to a Google project to use for requester pays buckets, if necessary. The Google project can be associated with a Terra workspace, but can also be an external project.

5.3.5. Update the run_deletes variable to True to actually execute the file deletions. It is recommended that you set this variable to False for the first run and review the output file prior to actually executing the deletions.

Additional references

Imported WDLs

  • CreateWorkspaceFileManifest - READ ME
  • TerraSummaryStatistics - READ ME
  • TerraWorkspaceTableToTDRIngest - READ ME
  • CleanUpStagingWorkspace - READ ME

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.