Writing workflow outputs to the data table

This article explains how to write output file metadata to the input data table.

To learn how to automate some of this setup step by using a JSON file (especially useful if you anticipate using similar configurations many times), see Getting workflows up and running faster with a JSON file.
To learn how to configure additional cost-saving options in Terra, see Workflow setup: Runtime options.

Workflow setup: Outputs overview

Generated data are stored in the Workspace bucket by default

The files are stored in directories with the submission and task UUIDs. You can access the files by selecting the Files icon at the bottom left of the data page.

When using the data table, you can choose to write output file metadata back to the table

You can have the workflow write links to the output files right in the data table. You specify under what column name the outputs are added to the data table in the configuration form. If the column doesn't already exist, Terra creates it when the generated data is written to the workspace bucket.

Why write outputs to the data table? Writing to the data table associates generated output with the input data file (the output files are written alongside the input files in the table), and helps organize your outputs in a way that is meaningful to you. It makes it easy to use the data for downstream analysis.

Outputs in Google bucket (file folder is non-human-friendly UUID)

The default folders for generated data are named by the submission ID, a nonhuman readable string of numbers and letters. This assures that you will not overwrite generated data (because the directory names are unique), but makes finding the files challenging.

Note the directory structure of the generated files: Files / c01b2b13-c2f5-4ea0-bc3b-319c963385ed / CramToBamFlow / 5bf8b92e-6ffa-4627-8265-6c5021d76677 / call-CramToBamTask

Outputs in the data table (clear associations)

Here's the same output file in the data table. Running the workflow generated the aligner_output_crai and aligner_output_cram columns. Note: The unique collaborator_sample_id references the entire row, associating the generated data with the primary data.

When you might not want to write outputs to the tableAlthough we generally recommend using data tables, you may not want to if you cannot fit your data into the data table in a way that makes sense for your analysis; or if you want to test a new method in Terra quickly - with as little setup as possible.

How to write output file metadata to the input data table

1. Go to the Outputs tab.

2. For each output variable, click into the attribute field.

You'll see a drop-down menu that lists all columns in the root entity data table.

3. Choose an existing column or type in a new name to add a new column of data to the table.

Workflow setup form example

Be careful not to overwrite metadata in the data table If you use the same output name for multiple runs, Terra will overwrite the links in the data table with the most recent output data link. Data from a previous run will still exist in the workspace bucket, however.

To compare results from different configurations, give your outputs a name that indicates which is which (see the example below).
configure-workflows_Multiple-test-outputs_Screen_shot.png

How to verify workflow output files

If your output attributes have the format "this.your_filename", the workflow writes output metadata to the "your_filename" column of the data table. You will see the additional data - or metadata, in the case of large generated data files stored in the workspace bucket - for these output files in the data table after a successful run.

For example, after completing Exercise 1 in the Workflows-QuickStart tutorial, there is an additional column in the student data table. The Cumulative_GPA column (circled in the screenshot below) stores the output data from running the workflow.

Workflows-Quickstart-Guide_Screenshot-of-added-column-in-student-table.png

Real-life use case: Generated data in workspace storage

In many cases, the outputs of a workflow are large data files (like CRAMs or VCFs from variant calling). In this case, the full path to the generated data in workspace storage is written to the data table. Whether or not you write metadata to the data table, you can find the output files in your workspace Google bucket by clicking on the "Files" icon in the left column of the Data tab.

Finding-output-files-in-Files-in-Data-page_Screen_shot.png

Note about output file folder names in the workspace bucket Each time you launch a workflow, Terra assigns a unique submission ID to that submission. This submission ID is the name of the output folder in the workspace Google bucket. Outputs from multiple submissions of the same workflow in the same workspace will not be overwritten since they are in different submission ID folders.