Workspace tables can make your research easier by helping manage all the original and generated data in different storage locations in the cloud. The Data Tables Quickstart workspace will introduce you to workspace data tables with hands-on practice using them to organize and analyze data in Terra.
First: Make your own copy of the Data Quickstart workspace
The Terra-Data-Tables-Quickstart featured workspace is “Read only”. For hands-on practice, you'll need to be able to store data in your workspace bucket and run workflows. Making you own copy of the Data-Tables-Quickstart workspace gives you that power. If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone from the dropdown menu:
- Rename your copy something memorable
It may help to write down the name of your workspace
- Choose your billing project
Note that this can be free credits! Don’t worry, you’ll have plenty left over when you’ve completed the Quickstart exercises.
- Do not select an Authorization Domain, since these are only required when using restricted-access data
- Click the “Clone Workspace” button to make your own copy
- Rename your copy something memorable
Once you're in your own copy of the workspace, you'll be ready to get hands-on to learn about data tables!
Overview: How to manage data in the cloud with data tables
One advantage of analyzing in the cloud is you're not limited to data stored on your local machine or cluster. You can run a workflow analysis on data you've uploaded to the workspace bucket or external bucket, or that's available in Terra's Data Library, or numerous other data repositories. Even better, you can analyze data from many different sources together in one big analysis.
But to make the most of large numbers of large data files stored in different places, you need a way to keep original data organized and accessible to analysis tools in your workspace, and to keep track of generated data. In Terra, the dedicated way to organize and access project data is with workspace data tables. Data tables include unique IDs for every kind of data, and can associate metadata and links to the physical location of genomic data. You can run a workflow analysis directly on data from a table, and import data into a notebook cloud environment as well.
Tables link project components in the workspace
What does a table look like? What does it include?
Tables are like spreadsheets built into your workspace, so a table looks a lot like a spreadsheet.
- Rows - each one is a separate entity, like a particular sample or participant
- Columns - each is a different variable or type of metadata corresponding to the particular entity. Below are two classic examples of tables - one of genomic data and one for phentotypic data.
Example tables: genomic and phenotypic data
Terra accepts any kind of data tables, to be able to keep track of whatever "entities" you need. The two examples below are fairly common, but not the only, types of tables. You can use the step-by-step instructions further down to create your own table.
Example 1: Genomic data table
Tables can help keep track of genomic data -both original and generated data files - no matter where the data are physically located.
A table of genomic data must have at least two columns to hold:
1) the unique ID for each distinct entity and
2) a link to the data fileThe table can include as many other columns as you need
(the "cram_path" column below is a link to a CRAM file in a Google bucket)
- additional metadata (such as the data type -see below
- or when and how the data were collected)
Example 2: Phenotypic data table
You can store phenotypic data directly in a workspace table.
A shared unique ID (such as the participant_id) links a participant's phenotypic data to genomic data in a different table.
Where's the data? Data in a workspace table can physically be anywhere in the cloud
The data isn't actually "in" the table (or even in the workspace, really...). Tables just help keep track of data - so you don't have to remember where the data files are actually located.
Being able to use the data that's physically located in an external bucket can be especially nice when you are working with large data files, as you do not have to pay to store the original data. It also means you can apply the techniques from this Quickstart tutorial to any data in Google Cloud Storage, as long as you have access to the data.
1.1. Explore the Data Tables Quickstart specimen table
The Data Tables Quickstart workspace has been preloaded with a table of nine specimens, which we will work with (and add to) throughout the exercises.
What do the workflows in the Quickstart tutorial actually do? The Quickstart workflows don't do any actual analysis. They're custom-built to run quickly - and that's all.
The first workflow takes in a particular kind of genomic data file (sequencer reads in FASTQ format) and return the file header in a text file.
The second workflow takes in a set of FASTQ files and returns one text file with a list of all the input file headers.
1. Navigate to the workspace Data page
You can do this by clicking on the "Data" tab at the top of your workspace.The Data page includes separate sections for input data (at the top), preloaded human reference files, and workspace-wide files (such as additional references, interval files, and docker files), as well as a link to the Google bucket (by clicking on the "Files" icon).
You'll see one "specimen" table in the top left of the TABLES column, which is reserved for input files. The number in parentheses tells you how many "specimens" there are:
Note: Your table can have any name (it can refer to any kind of entity)! It doesn't have to be specimens or samples, or participants.
2. Explore data for specimens in the table by clicking on the "specimen" link
Each row in the table corresponds to a distinct specimen, and each column is a different type of information about that specimen:
- "specimen ID" - a unique value used to identify each specimen
- "participant" - what species the specimen belongs to (human or mouse)
- "r1_fastq" - the actual specimen data, in this case a link to the FASTQ file that the workflow will use as input
Additional information would be in additional columns Tables can have as many columns of metadata as you need. The minimum data table includes the ID column and one data column.
In fact, the "participant" column in this table is extra information.
Some specimens come from a mouse and some from human participants. You could include a column with the date the specimen was collected, for example, or information about the patient, or additional data files - whatever you want to associate with that specimen.
1.2. Run a workflow on a specimen in a table
To really understand how to use tables in Terra, it will help to run an "analysis" on data in a table - even though the Data Quickstart analysis workflow doesn't do any real analysis. It just reads the data file and outputs the header to a text file.
After selecting one specimen from this simple table, you'll run a pre-configured workflow on it to see how the workflow reads input from a table. Looking at the output, you'll see how the table helps keeps output and input data organized and associated with the right entity, by looking at the same table after the workflow is complete.
Step 1: Choose the specimen
1. Choose a specimen to process
After expanding the data table, click on any one of the nine specimens.
2. Click on the three vertical dots beside the "1 specimen selected" text and then choose the "Open with" option
3. Click the "Workflow" button in the modal that appears.
4. To select the workflow, click on "1-Single-Input-Workflow".
You'll be redirected to the workflow input form, where you will set up (configure) and launch the workflow. The configuration form is filled out, but it's always a good idea to look it over to confirm.
Step 2: Confirm workflow configuration (optional)
You should see the following on the workflow configuration form:
1. Choose radio button (left column, in the middle of the form): "Run workflow(s) with inputs defined by data table"
2. Step 1 dropdown: Select root entity type "specimen"
3. Step 2 button: Text beside the button should say "1 selected specimen"
4. Use call caching should be checked
5. Delete intermediate outputs should be unchecked
What's a "root entity"? According to the dictionary, an "entity" is "a thing with distinct and
independent existence." In Terra, the "root entity" is the smallest amount of data that a workflow can use as input.
Tables are identified (named) according to their root entity. You can have tables of entities like samples (a "sample" table) or of tissues (a "tissue" table"), or unicorns (a "unicorn" table), for example.
The inputs and outputs for the Data-Tables-Quickstart have been pre-configured to read from and write to the data table. To learn more about setting up a workflow analysis, see the Workflows-Quickstart or this article on how to set up a workflow analysis.
Why write outputs to the data table? Data generated from running a workflow is stored in the workspace bucket by default. However, tracking down generated data can be difficult because the folder names are random strings of characters, which are not human-readable.
To make it easier to keep track of generated data, and to associate the outputs with the right input data, you can tell the workflow to write outputs to the workspace table where it got the inputs using the "Outputs" section of the configuration form.
Step 3: Launch the workflow
Once you have selected the specimen to run, you can run the workflow
1. Click on "RUN ANALYSIS" button.
2. Click the "LAUNCH" button to confirm.
1.3. Monitor your workflow
Once you press "Launch", Terra will submit your workflow to the virtual computer in the cloud and redirect you to your workspace Job History page, where you can monitor your submission to make sure everything is going well.
You'll see the status of your submission in the Job History page:
It could take a few minutes to queue and submit the workflow. If you were running a lot of computation, this is where you could go away and wait for your workflow to finish. Luckily, the Data Quickstart workflow should run very quickly (it's only reading in the top line of the input FASTQ file and writing the text to the output file. It will take longer to set up the virtual machine than to actually run the WDL!). When it is done running, you will see the Status turn to "Done" along with a cheery green check mark:
1.4. Follow-up questions
Wait until your workflow successfully completes and open the specimen table in your data page again. Then think about the following. Click to expand for detailed answers and screenshots.
- There's a link to the output generated by the workflows in a new column in the specimen table. See the additional column (screenshot below)? Does the name look familiar? (check in the Outputs part of the configuration card!).
Hopefully you can see from this example how connecting generated data to the original sample in the table helps you manage and organize data. Writing links to the generated files in a table make it convenient to use the outputs in downstream analysis. Note how all the data files associated with this particular specimen are in the same row of the data table.
What does the workflow output look like? The output of this workflow is a text file that contains the FASTQ headers from the input file. If you click on the link, you'll get this screen, with the output file name, a preview of what's in it, and the file size:
- Generated data files are stored in the workspace bucket by default, but with random strings identifying the bucket and the folders.
Without the links in the data table, you would need to go down four levels of nested folders (labeled with the random strings of numbers and letters) in Google Cloud Storage to find the same files. Go to the "Files" icon at the bottom left of the workspace Data page to see!
In contrast, the links in a table have names that you may recognize. Go back to the "Outputs" tab of the workflow configuration card to check where this gets set up:
If you click on the link for the output file, you'll be able to view the file in GCP console, see the full path to the output in the workspace bucket, or download to your local machine.
The WDL was set up to write to the data table
Using the "this. " format in the workflow configuration form tells the workflow to get the location for the input file from the data table. In the Data Tables QuickStart, it was already set up for you.
For a tutorial on setting up workflows, see the Terra Workflows QuickStart.
The attribute field (in the workflow configuration form) becomes the header of the new table column
If you go to the "Outputs" part of the configuration form, you'll see the name that was pre-configured in the output attribute is the header in the new column:
|Congratulations! You've completed Part 1 of the Data Tables Quickstart!|