Workspace tables can make your research easier by helping manage all the original and generated data in different storage locations in the cloud. The Data Tables QuickStart workspace will introduce you to workspace data tables with hands-on practice using them to organize and analyze data in Terra.
Start by going to the Data Tables Quickstart and making your own copy.
Overview: How to managing data in the cloud with workspace tables
- What does a data table look like?
- Where's the data?
Exercise 1.1. Explore the specimen table in the Data QuickStart workspace
Exercise 1.2. Run a workflow analysis on a single specimen in a table
Exercise 1.3. Follow-up thought questions
First - Make your own copy of the Data QuickStart workspace
The Data-QuickStart featured workspace is “Read only”. For hands-on practice, you'll need to be able to store data in your workspace bucket and run workflows. Making you own copy of the Data-QuickStart workspace gives you that power. If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone from the dropdown menu:
Step-by-step instructions + video tip
Once you're in your own copy of the workspace, you'll be ready to get hands-on to learn about data tables!
Overview: How to manage data in the cloud with workspace tables
One advantage of analyzing in the cloud is you're not limited to data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to the workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources together in one big analysis.
But to make the most of large numbers of large data files stored in different places, you need a way to keep original data organized and accessible to analysis tools in your workspace, and to keep track of generated data. In Terra, the dedicated way to organize and access project data is with workspace data tables. Data tables include unique IDs for every kind of data, and can associate metadata and links to the physical location of genomic data. You can run a workflow analysis directly on data from a table, and import data into a notebook cloud environment as well.
Tables link project components in the workspace
What does a table look like? What does it include?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.
- Rows - each one is a separate entity, like a particular sample or participant
- Columns - each is a different variable or type of metadata corresponding to the particular entity. Below are two classic examples of tables - one of genomic data and one for phentotypic data.
Terra accepts any kind of data tables, to be able to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table.
Example tables - genomic and phenotypic data
Example 1: Genomic data table
Tables can help keep track of genomic data -both original and generated data files - no matter where the data are physically located. A table of genomic data must have at least two columns to hold: 1) the unique ID for each distinct entity and 2) a link to the data file (the "cram_path" column below is a link to a CRAM file in a Google bucket) . The table can include as many other columns as you need - for example, for additional metadata (such as the data type -see below - or when and how the data were collected):
Example 2: Phenotypic data table
You can store phenotypic data directly in a workspace table. A shared unique ID (such as the participant_id) links a participant's phenotypic data to genomic data in a different table:
Where's the data?
Data in a workspace table can physically be anywhere in the cloud.
They're not actually "in" the table (or even in the workspace, really...).
Being able to use the data that's physically located in an external bucket
Setting the stage - Thinking about YOUR data
What does your data look like?
Before setting up your workspace tables, consider what data you have and how you'll use that data for analysis. Some questions to consider:
- What's the smallest piece of data you will be working with? This will be the table's "root entity type" - also the name of your table. In the QuickStart, we will be looking at specimens - and sets of specimens - and their associated data files. You may categorize your data as samples, or lanes of reads.
- Does your workflow take a single input file and output a single output? Or does it take a number of input data files top generate one analysis output file?
What do the QuickStart workflows do?
The QuickStart workflows don't do any actual analysis. They're custom-
The first workflow takes in a particular kind of genomic data file
The second workflow takes in a set of FASTQ files and returns one text
Exercise 1.1. Explore the Data QuickStart specimen table
The Data QuickStart workspace has been preloaded with a table of nine specimens, which we will work with (and add to) throughout the exercises.
1. Navigate to the workspace Data page
You can do this by clicking on the "Data" tab at the top of your workspace.The Data page includes separate sections for input data (at the top), preloaded human reference files, and workspace-wide files (such as additional references, interval files, and docker files), as well as a link to the Google bucket (by clicking on the "Files" icon).
You'll see one "specimen" table in the top left of the TABLES column, which is reserved for input files. The number in parentheses tells you how many "specimens" there are:
Note: Your table can have any name (it can refer to any kind of entity)! It doesn't have to be specimens or samples, or participants.
2. To expand the table, click on the "specimen" link
Each row in the table corresponds to a distinct specimen, and each column is a different type of information about that specimen:
- "specimen ID" - a unique value used to identify each specimen
- "participant" - what species the specimen belongs to (human or mouse)
- "r1_fastq" - the actual specimen data, in this case a link to the FASTQ file that the workflow will use as input
3. Additional information in other columns
Tables can have as many columns of metadata as you need. In fact, the "participant" column in this table is extra information - some specimens come from a mouse and some from human participants. You could include a column with the date the specimen was collected, for example, or information about the patient, or additional data files - whatever you want to associate with that specimen.
Exercise 1.2. Run a workflow on a specimen in a table
To really understand how to use tables in Terra, it will help to run an "analysis" on data in a table - even though the Data QuickStart analysis just reads the data file and outputs the header to a text fil. After selecting one specimen from this simple table, you'll run a workflow on it to see how the workflow reads input from a table. Then, you'll see how the table helps keeps output and input data organized, and associated with the right entity, by looking at the same table after the workflow is complete.
Choose the specimen (step-by-step instructions)
- Choose a specimen to process (after expanding the table, click on any one of the nine specimens)
- Click on the three vertical dots beside the "1 specimen selected" text and then choose the "Open with" option
- Click the "Workflow" button in the modal that appears
- To select the workflow, click on "1-Single-Input-Workflow"
You'll be redirected to the workflow input form, where you will set up (configure) and launch the workflow. The configuration form is mostly filled out, but we'll walk through each option as you confirm that you have the right ones.
Confirm workflow configuration
- Input choice radio button (left column, in the middle of the form): "Run workflow(s) with inputs defined by data table" (radio button on left side):
- Step 1 dropdown: Verify the input table "root entity type" = the "specimen" table
- Step 2 dropdown: Verify the data you want to process ("1 selected specimen")
- Use call caching: Should be checked
- Delete intermediate outputs: Leave unchecked
What's a "root entity" in Terra?
According to the dictionary, an "entity" is "a thing with
Set the Inputs in the configuration form
- Input file (FASTQ): The first input variable is the FASTQ file, `r1_fasta`. You will use a particular format that tells the WDL to look in the table: begin the field with `this. ` ( note that this particular formatting requirement is unique to Terra).
When you start typing, you will get all available table column names in a dropdown. Select "this.r1_fastq" from the dropdown:
- Specimen ID: Scroll down to the "SampleName" variable - this is the unique ID that for the specimen (the first column). Again, start typing `this. ` and select "this.specimen_id" from the Attributes dropdown:
Note that the workflow we use in the data-QuickStart only has two input attribute fields. Many workflows include additional inputs such as reference data, index files, disk sizes and so on. You can change any of these in the same Inputs attributes form.
Variables versus attributes
The "Variable" is what is in the WDL code. The "Attribute" is the name of the column in the table. In this case, their names exactly match for the data input (`r1_fasta` in both) but not the ID (`sample_id` - what the workflow calls the ID - is different than `specimen_id` - what is in the table).
Set up the workflow to write output to the table
- Go to the "Outputs" section of the form
- Fill in the attribute for the "First_Line_Output" variable
You'll use the same formatting you did for Input Attributes. The WDL will make a new column in the table with the header you specify after `this` in the output attribute field. You can name the attribute whatever you want, as long as you start with the `this. ` formatting.
Why write outputs to the data table?
Data generated from running a workflow is stored in the
To make it easier to keep track of generated data, and to
Save inputs and outputs and launch the workflow
- Save your changes (configuration) by clicking the blue "Save" button above the Attributes column:
Notice that the "Run Analysis" button will turn blue:
- Click on "RUN ANALYSIS" button
- Click the "LAUNCH" button to confirm your launch:
Congratulations! You've launched your first workflow on a data file in a workspace table!
Next step: Monitor your workflow
Once you press Launch, Terra will submit your workflow to the virtual computer in the cloud and redirect you to your workspace Job History page, where you can monitor your submission to make sure everything is going well. You'll see the status in the Job History page:
It could take a few minutes to queue and submit the workflow. If you were running a lot of computation, this is where you could go away and wait for your workflow to finish. Luckily, the Data QuickStart workflow should run very quickly (it's only reading in the top line of the input FASTQ file and writing the text to the output file. It will take longer to set up the virtual machine than to actually run the WDL!). When it is done running, you will see the Status turn to "Done" along with a cheery green check mark:
1.3. Follow-up questions
Wait until your workflow successfully completes and open the specimen table in your data page again. Then think about the following.
1. What additional information is in the specimen table?
Hopefully you can see from this example how connecting generated data to the original sample in the table helps you manage and organize data. Writing links to the generated files in a table make it convenient to use the outputs in downstream analysis. Note how all the data files associated with this particular specimen are in the same row of the data table.
What does the workflow output look like?
The output of this workflow is a text file that contains the FASTQ headers
2. Where's the data? How do tables help me find what I need?
Without the links in the data table, you would need to go down four levels of nested folders (labeled with the random strings of numbers and letters) in Google Cloud Storage to find the same files. Go to the "Files" icon at the bottom left of the workspace Data page to see!
In contrast, the links in a table have names that you will recognize (because you set them up). If you click on the link, you'll be able to view the file in GCP console, see the full path to the output in the workspace bucket, or download to your local machine.
3. Why did the workflow add columns to the data table?
Remember when you filled in the workflow configuration form? Under the Inputs, you selected "this.specimen" for the "r1_fasta" variable and "this.specimen_id" for the "sample_id"? Using the "this. " format when setting up the workflow tells the workflow to get the location for the input file from the data table.
The attribute field becomes the header of the new table column
If you go to the "Outputs" part of the configuration form, you'll see the name you specified in the output attribute is the header in the new column:
Next up: Add your own data table