Workspace tables can make your research easier by helping manage all the original and generated data in different storage locations in the cloud. The Data Tables Quickstart workspace will introduce you to workspace data tables with hands-on practice using them to organize and analyze data in Terra.
Overview: How to managing data in the cloud with data tables
- What does a data table look like?
- Where's the data?
Exercise 1.1. Explore the specimen table in the Data Tables Quickstart workspace
Exercise 1.2. Run a workflow analysis on a single specimen in a table
Exercise 1.3. Follow-up thought questions
First - Make your own copy of the Data Quickstart workspace
The Terra-Data-Tables-Quickstart featured workspace is “Read only”. For hands-on practice, you'll need to be able to store data in your workspace bucket and run workflows. Making you own copy of the Data-Tables-Quickstart workspace gives you that power. If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone from the dropdown menu:
Step-by-step instructions + video tip
Once you're in your own copy of the workspace, you'll be ready to get hands-on to learn about data tables!
Overview: How to manage data in the cloud with data tables
One advantage of analyzing in the cloud is you're not limited to data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to the workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources together in one big analysis.
But to make the most of large numbers of large data files stored in different places, you need a way to keep original data organized and accessible to analysis tools in your workspace, and to keep track of generated data. In Terra, the dedicated way to organize and access project data is with workspace data tables. Data tables include unique IDs for every kind of data, and can associate metadata and links to the physical location of genomic data. You can run a workflow analysis directly on data from a table, and import data into a notebook cloud environment as well.
Tables link project components in the workspace
What does a table look like? What does it include?
Tables are basically spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.
- Rows - each one is a separate entity, like a particular sample or participant
- Columns - each is a different variable or type of metadata corresponding to the particular entity. Below are two classic examples of tables - one of genomic data and one for phentotypic data.
Terra accepts any kind of data tables, to be able to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table.
Example tables - genomic and phenotypic data
Example 1: Genomic data table
Tables can help keep track of genomic data -both original and generated data files - no matter where the data are physically located. A table of genomic data must have at least two columns to hold: 1) the unique ID for each distinct entity and 2) a link to the data file (the "cram_path" column below is a link to a CRAM file in a Google bucket) . The table can include as many other columns as you need - for example, for additional metadata (such as the data type -see below - or when and how the data were collected):
Example 2: Phenotypic data table
You can store phenotypic data directly in a workspace table. A shared unique ID (such as the participant_id) links a participant's phenotypic data to genomic data in a different table:
Data in a workspace table can physically be anywhere in the cloud
They're not actually "in" the table (or even in the workspace, really...). Tables just help keep track of data - so you don't have to remember where the data files are actually located.
Being able to use the data that's physically located in an external bucket can be especially nice when you are working with large data files, as you do not have to pay to store the original data. It also means you can apply the techniques from this Quickstart tutorial to any data in Google Cloud Storage, as long as you have access to the data.
The Quickstart workflows don't do any actual analysis. They're custom-built to run quickly
The first workflow takes in a particular kind of genomic data file (sequencer reads in
The second workflow takes in a set of FASTQ files and returns one text file with a list of all
Exercise 1.1. Explore the Data Tables Quickstart specimen table
The Data Tables Quickstart workspace has been preloaded with a table of nine specimens, which we will work with (and add to) throughout the exercises.
1. Navigate to the workspace Data page
You can do this by clicking on the "Data" tab at the top of your workspace.The Data page includes separate sections for input data (at the top), preloaded human reference files, and workspace-wide files (such as additional references, interval files, and docker files), as well as a link to the Google bucket (by clicking on the "Files" icon).
You'll see one "specimen" table in the top left of the TABLES column, which is reserved for input files. The number in parentheses tells you how many "specimens" there are:
Note: Your table can have any name (it can refer to any kind of entity)! It doesn't have to be specimens or samples, or participants.
2. Explore data for specimens in the table by clicking on the "specimen" link
Each row in the table corresponds to a distinct specimen, and each column is a different type of information about that specimen:
- "specimen ID" - a unique value used to identify each specimen
- "participant" - what species the specimen belongs to (human or mouse)
- "r1_fastq" - the actual specimen data, in this case a link to the FASTQ file that the workflow will use as input
Tables can have as many columns of metadata as you need. The minimum data table
In fact, the "participant" column in this table is extra information - some specimens
Exercise 1.2. Run a workflow on a specimen in a table
To really understand how to use tables in Terra, it will help to run an "analysis" on data in a table - even though the Data Quickstart analysis workflow doesn't do any real analysis. It just reads the data file and outputs the header to a text file.
After selecting one specimen from this simple table, you'll run a pre-configured workflow on it to see how the workflow reads input from a table. Looking at the output, you'll see how the table helps keeps output and input data organized and associated with the right entity, by looking at the same table after the workflow is complete.
Choose the specimen (expand for instructions)
- Choose a specimen to process
After expanding the data table, click on any one of the nine specimens.
- Click on the three vertical dots beside the "1 specimen selected" text and then choose the "Open with" option
- Click the "Workflow" button in the modal that appears
- To select the workflow, click on "1-Single-Input-Workflow"
You'll be redirected to the workflow input form, where you will set up (configure) and launch the workflow. The configuration form is filled out, but it's always a good idea to look it over to confirm.
Confirm workflow configuration (optional)
1. Choose radio button (left column, in the middle of the form): "Run workflow(s) with inputs defined by data table"
2. Step 1 dropdown: Select root entity type "specimen"
3. Step 2 button: Text beside the button should say "1 selected specimen"
4. Use call caching should be checked
5. Delete intermediate outputs should be unchecked
According to the dictionary, an "entity" is "a thing with distinct and
Tables are identified (named) according
The inputs and outputs for the Data-Tables-Quickstart have been pre-configured to read from and write to the data table. To learn more about setting up a workflow analysis, see the Workflows-Quickstart or this article on how to set up a workflow analysis.
Data generated from running a workflow is stored in the workspace bucket by default. However, tracking down generated data can be difficult because the folder names are random strings of characters, which are not human-readable.
To make it easier to keep track of generated data, and to associate the outputs with the right input data, you can tell the workflow to write outputs to the workspace table where it got the inputs using the "Outputs" section of the configuration form.
Launch the workflow
Once you have selected the specimen to run, you can run the workflow
- Click on "RUN ANALYSIS" button
- Click the "LAUNCH" button to confirm your launch:
- Click on "RUN ANALYSIS" button
You've launched your first workflow on a data file in a workspace table!
Next step: Monitor your workflow
Once you press "Launch", Terra will submit your workflow to the virtual computer in the cloud and redirect you to your workspace Job History page, where you can monitor your submission to make sure everything is going well.
You'll see the status of your submission in the Job History page:
It could take a few minutes to queue and submit the workflow. If you were running a lot of computation, this is where you could go away and wait for your workflow to finish. Luckily, the Data Quickstart workflow should run very quickly (it's only reading in the top line of the input FASTQ file and writing the text to the output file. It will take longer to set up the virtual machine than to actually run the WDL!). When it is done running, you will see the Status turn to "Done" along with a cheery green check mark:
1.3. Follow-up questions
Wait until your workflow successfully completes and open the specimen table in your data page again. Then think about the following.
1. What additional information is in the specimen table? (click for answer)
Hopefully you can see from this example how connecting generated data to the original sample in the table helps you manage and organize data. Writing links to the generated files in a table make it convenient to use the outputs in downstream analysis. Note how all the data files associated with this particular specimen are in the same row of the data table.
The output of this workflow is a text file that contains the FASTQ headers from the input
2. Where's the data? How do tables help me find what I need? (click for answer)
Without the links in the data table, you would need to go down four levels of nested folders (labeled with the random strings of numbers and letters) in Google Cloud Storage to find the same files. Go to the "Files" icon at the bottom left of the workspace Data page to see!
In contrast, the links in a table have names that you may recognize. Go back to the "Outputs" tab of the workflow configuration card to check where this gets set up:
If you click on the link for the output file, you'll be able to view the file in GCP console, see the full path to the output in the workspace bucket, or download to your local machine.
3. Why did the workflow add columns to the data table? (click for answer)
Using the "this. " format in the workflow configuration form tells the workflow to get the location for the input file from the data table. In the Data Tables QuickStart, it was already set up for you.\
The attribute field becomes the header of the new table column
If you go to the "Outputs" part of the configuration form, you'll see the name that was pre-configured in the output attribute is the header in the new column: