This article answers the following questions:
Workflows on Terra handle inputs in one of five ways, based on the category the samples - or "entities" - fall in to. These entity types are not native to the Workflow Description Language (WDL), rather they are analogous to object classes developed for specific uses. The concept of entity types in Terra/GATK is inherited from TCGA's metadata style, which was designed to categorize different types of data files, such as files that track treatment regimens, files that track the progress of samples through pipelines, etc.
In Terra, the concept is significantly simpler. The 5 classes - called root entity types - are used to organize data that need to be very carefully grouped. For example, a cancer study may include multiple samples from multiple individuals, and grouping the data properly is critical for each step of the analysis. This article will help you understand some of the technical details surrounding these entity types:
- Sample set
- Participant set
- Pair set
Sample: The most basic data entity is the nucleotide sequence produced by a single lane.
Sample set: This nested entity type is used in certain tools specifically designed to carry out steps that requires an input of a group of multiple samples, such as with joint genotyping or panel-of-normals generation.
Participant: This entity type uses a participant id to keep track of samples that belong to the same individual, for instance if survey subjects provide multiple types of biological samples (spit, blood, etc).
Participant set: This entity type can be multiply nested, as it can group multiple participants, each with multiple samples per participant.
Pair set: This entity type is tailored for somatic workflows, which have a specific way of handling paired tumor and normal samples taken from the same patient.
Here are some examples of .tsv spreadsheets representing different entity types:
In some cases, when working with nested data, you have to upload the .tsv files in a specific order and you must then select “Create participant, sample, pair associations” when prompted for the tables to connect. This option will appear in a pop-up window when you try to upload the files. For example, when you run Mutect2, if you uploaded the tables without selecting that checkbox, the workflow will fail because it reads from the Pair table but has no knowledge of the Sample table from which the Pair table is supposed to read.
The order is as follows ("A > B" means entity type A must be uploaded before entity type B):
- participants > samples
- samples > pairs
- participants > participant sets
- samples > sample sets
- pairs > pair sets
- set membership > set entity (e.g. participants > samples > sample set membership > sample set entity.)
Is this data entity flat or nested?
Many workflows expect nested data entities as inputs. You can think of the difference between a flat and a nested entity as the difference between a single file and folder containing a set of files. This distinction is similar (and related) to whether the entity should contain a single value or an array, but is also tied to the organization of the samples.
The difference between a flat data entity and a nested one is hierarchical - nested data entities are tables that organize individual pieces of data from flat entities. For example, a workspace may contain six separate samples, as well as a sample set composed of those individual samples. You should be able to see both the individual samples and the sample set in your data tab. If you click on a set, you should be able to see how the individual rows in that set point to individual samples or participant IDs. In the Optimus pipeline featured workspace, we include data from 8 tissue samples taken from 2 participants - 6 tissue samples from a mouse and 2 tissue samples from a human:
Another important example applies to somatic workflows in general. Our Somatic Variant Discovery Best Practice guidelines stress the importance of using a matched normal sample to compare to the tumor sample. Workflows that compare tumor tissue with normal tissue typically expect a pair set as the input. If you attempt to launch a Mutect2 workflow by selecting the tumor and the normal files as individual samples, the workflow will fail.
The WRONG way:
The RIGHT way:
Is the data in question a single file/string or an array?
A data entity might refer to only one file or string (a URL pointing to a single BAM stored in the cloud), or it might refer to a set of files (for example, a set of BAMs for a cohort), in which case the entity must be treated as an array rather than a singular value.
With GATK workflows, the entity types are preset for you, and you just need to make sure you know which data type is expected for the input. One easy way to check this is by navigating to the Workflows tab in your Terra account, clicking on a given workflow and using the tabs at the bottom of the page to look at the inputs and outputs for this workflow:
You can see under the column labeled "Type" that this particular workflow expects an array of files. This example is taken from our Optimus pipeline featured workspace which expects an array due to its experimental design. If you wanted to run this workflow on single files rather than an array, you would need to modify the WDL script and re-upload the modified version. If you have done this correctly, the value in the "type" column would change to File.
Another difference between flat and nested entity types is in their syntax. If you go to the workflows tab, select a workflow, click on Inputs and scroll through its expected inputs, you can see the entity type under the Type column, and the syntax for handling that type under the Attributes column:
- For individual samples: This.Name
- For sample sets: This.Samples.Name
If your desired input is a single file, the syntax simply points directly at the file. If your desired input is a set of files nested inside of a folder, the syntax must first point to the correct folder, and then point to the desired files within. Looking at the Type and Attributes columns serves as a quick way to check how your workflow is set up.
If you want to edit a workflow, you can do so either by writing your own WDL script from scratch, or by downloading the WDL script for your workflow of interest, modifying it as you wish, and re-uploading the method yourself. A detailed description of creating and editing workflows can be found here.
Editing the WDL script will change the expected input configuration. You will be able to see this by clicking on the workflow in the Workflows tab and looking at in the Inputs section therein. Below is an example taken from the workflow that generates a "Panel of Normals" (PoN). When generating a PoN, this WDL script expects some of the following input types:
- A set of BAM files representing the list of normal samples. Since the purpose of this workflow is to create a PoN from a set of files, this input is handled as an Array.
- A reference file. Since a single reference file can be useful in a variety of tasks, this input is handled as a File.
- The name of a database used for informing the PoN generation (in this case, the gnomAD database is used to inform the tool of the allelic fractions within this germline resource). Since this task does not need to localize the entire gnomAD database, it is sufficient to designate an input as String matching the name of the database. The name of the PoN file is also just a String.
For a detailed explanation of configuring your workflow inputs in the Terra interface, see this article. To learn more about scripting in WDL, you can read our WDL user guide or check out the OpenWDL community that was formed to steward the WDL language specification and advocate its adoption.