Terra (GCP) Quickstart 1: Data tables

Workspace tables can make your life easier by helping you manage data - including files from different cloud storage locations - in one place. Part 1 of the Terra Quickstart will give you hands-on practice organizing and accessing data with workspace data tables.

If you're new to working in the cloud, see Understanding data in the cloud for a useful conceptual overview of data in Terra.

About the Quickstart mock study

The Terra Quickstart a set of of three tutorials that walk through a fake (made up by User Ed!) study. Each tutorial covers Terra functionality used in a "typical" analysis journey in Terra. As with a real study, you will do all the tutorial exercises in a single Terra workspace.

Explore data > Process raw data > Visualize results

Study question: Is there a correlation between height and grades in school?

Your mission, should you choose to accept it, is to discover if there is a correlation between student height and grades by the end of all three Quickstart parts. Yes, the mock study is a little silly, but in doing it you'll learn how to use functionality typical for many bioinformatics investigations.

Part 1: Data tables: The workspace includes survey data (height plus subject grades for language arts, math and science) for 86 middle school students in the study.
Part 2 Workflows: You'll run a workflow to calculate the average GPA of each student in the study
Part 3 Notebooks: You'll run an interactive Jupyter notebook plotting height versus the average GPA from part 2 to answer the study question.

Overview: Data tables tutorial learning objectives

When working in Terra, you'll use data tables - like spreadsheets built-into your workspace - to store and organize data. The data tables quickstart is intended to help you become familiar with data tables in Terra - how to use them to store and modify data in the cloud right in your workspace.

After working through the exercises in the quickstart, you will know how to

Organize and manage data in a data table in your Terra (GCP) workspace
Edit an existing table
Upload a table (TSV)
Import realistic biomedical data tables from a public workspace

Three steps to complete the Data tables Quickstart

Explore and manipulate data in the student table (preloaded)
Make a subset of students (a student_set table)
(optional) Import/explore a more realistic example data table for a use case that interests you

Estimated time and cost to completeYou should be able to complete the Quickstart tutorial in half an hour. Running the tutorial will cost less than $0.25 (Google Cloud data storage and VM costs).

Additional requirements and costs
You will need to have a Terra Billing project and your own copy of the Quickstart workspace to complete the tutorial.

First: Make your own copy of the Terra on GCP Quickstart workspace

The Terra on GCP Quickstart is “Read only”. For hands-on practice, you'll need to be able to upload data to workspace storage, which has a cost. Making your own copy of the Quickstart workspace gives you that power!

If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.

Start by clicking on the round circle with three dots at the upper right hand corner and select Clone from the dropdown menu.

Rename your copy something memorable
It may help to write down the name of your workspace
Choose your billing project
Note that this can be free credits! Don’t worry, you’ll have plenty left over when you’ve completed the Quickstart exercises.
Do not select an Authorization Domain, since these are only required when using restricted-access data
Click the “Clone Workspace” button to make your own copy

Once you're in your own copy of the workspace, you can get hands-on to learn about data tables!

Walkthrough demo

1. Explore data in the student table

Tables are like spreadsheets built into your workspace. Part 1 of the data tables tutorial is a guided tour of how to organize and manipulate data in tables in Terra It works much like your favorite spreadsheet editor.

For a conceptual overview of how tables can help organize data in the cloud, see Managing data in the cloud with tables.

Step-by-step instructions

The exercises below guide you through a number of spreadsheet-like actions you can take to manage and manipulate data tables right in your workspace (without downloading the table TSV and editing in a spreadsheet editor).

1.1. Open and examine the data table

1. Open your copy of the Terra on GCP Quickstart workspace.

2. Click on the Data tab. From the left hand side, click on the student table to see the full table.

3. Take a minute to look over the information in the columns and rows.

Thought questions

For background, don't forget to read the section About the quickstart study above.

Answer: Each row in the student table is a different student.

Rows represent separate entities of dataEach row is a unique input (an "entity"). You can define and name your table entity by the input data you have.

If you were working with a sample table, each row would be a distinct sample.
Each column is a different piece of information corresponding to the student.

Tip: the first column in a table is always the key (unique ID) for the entity in that row. In the data tables quickstart, the first column is the Student ID.

The unique ID is often non-human readable (a random combination of numbers and letters). To make it easier for a human to understand, there is often a column with a human-friendly ID. In this case, a column with the student's first name. Additional columns store the primary data (the GPAs for three different subjects, plus the student's height and grade level). There are also some columns of useful information - like the units used for the height measurement.

What data can be stored in a table?Like spreadsheets, tables are very flexible and can contain as many rows and data as you need of whatever data you have For example, a sample table could include information about the samples, like a column for the study name or ascension number, the date the sample was taken) as well as a link to the genomic data stored in a Google bucket.

It doesn't hurt to add additional columns of anything you want to remember about that entity. You can always hide columns in your workspace!
The student table includes both Primary data (heights and GPAs that will be used in the study) as well as metadata (additional useful data, in this case). It could also include links to files stored externally in a Google bucket, like pictures or other school records.

Primary data versus metadataPrimary data in a Terra data table is the sort that would traditionally be in a CSV. Examples are phenotypic data, i.e., a subject’s clinical information such as disease symptoms, lab results, or demographic data, including age, ethnicity, and gender.

Metadata is data about data. This could be links to genomic data files, or other information about the genomic data files (file size, the date they were created, experimental process). This is how Terra "stores" large genomic files in a table. The data files are physically located in workspace or external cloud storage. The table keeps track of the links to the files, and lets you use the data no matter where it's actually located.

1.2. Customize your table view

To customize what you see in your workspace while still keeping all the information, you can hide or move certain columns. In this exercise you'll hide the Height_units column (you don’t need this column to complete your mission).

Hide the Heights_unit column

1. Click on the gear icon on the top menu to open the Settings.

2. Unclick the height_units column to hide it.

3. Drag and drop the Grade column using the pixelated bar to below the Height_units column.

4. Click done.

Note that you can save the custom view, to share with colleagues, for example.

1.3. Sort and search data

Need to find a particular student but don't want to scroll through the entire table? You can search within a single table or between all tables in the workspace using the search fields.

You can also sort data in ascending or descending order by column.

Search within this table

1. Try looking for Hrdika in the table using the Search field at the top right of the table.

Sort by column values

2. Click the blue arrow at the top of the first column to sort the student table by ascending or descending order of the student IDs.

1.4. Change a single field of data

Change a student's name

1. Click on the student named Dulce in the first-name column.

2. Click on the pencil icon to open the edit value box.

3. Change Dulce’s name to Sabrosa, then click Save changes.

1.5. Add or delete data

Add a row (i.e. a new student)

1. Add a row by clicking the pencil icon next to Edit, then select the Add row option.

2. You will need to add a new student ID (ZZ637) and fill in the values for the new student you are adding (First name, GPA_language arts, GPA maths, GPA_science, Grade, Height, Height units) then click add.

You can make up these values!

Delete the row you just created

4. Select the checkbox for that row, then click Edit (pencil icon above the table) and Delete selected rows.

Add a column of data

1. Click on the pencil icon next to Edit, then select the Add column option.

2. Enter the column name and what variable type will be in the row.

For example, if you wanted to add the Students’ last names, you would enter Last-name in the Column name field and select value type “string”.

You can make up a column of metadata about the student!

Delete or clear a column

3. Delete or clear the column you just created by clicking on the three dot action icon in the column header and selecting the appropriate option.

For a comprehensive list of things you can do in a workspace data table, see the Organizing data with tables section.

2: Make a subset of the student data

There are times when you may want to run an analysis on the same subset of the data many times (when testing a workflow, for example). Selecting the right rows of data manually every time is error-prone and tedious, but Terra has a way to define particular subsets, which you'll learn about below.

Step-by-step instructions to make a student_set table

2.1. Sort the student table alphabetically by their ID by clicking the arrow in the student_id column. Make sure it's sorted in ascending order (As at the top...).

2.2. Check off the top eight rows of students.

2.3. Click the Edit button (pencil icon) above the table entries.

2.4. Choose Save selection as set from the menu.

2.5. Name your set subset-8 and click save.

What to expect

Notice that there is a new student_set table in the workspace. Open this table to see what it contains.

The student_set table includes one row and two columns.

Each row is a unique set (subset of the student data)

There's only one row, which corresponds to the set subset-8 that you just made.

Columns

The first column is the unique student set ID: subset-8. The second column is an array that includes all the students in the set. Note that the values in the student_set table reference the unique student IDs from the student table.

Additional tables in a workspace

Besides the student and student_set (input data tables), there are two other special kinds of tables in every Terra Workspace: Reference Data (in the middle of the left-hand column), and Workspace Data (under Reference Data).

Reference data - Includes a variety of pre-loaded reference data you can add, for easy reference. These reference files can be quite large! They are stored and paid for by the Broad.
Workspace Data - This special table is for keeping workspace-level files and variables that you might use for analyzes across different inputs. Examples include Docker files, CSVs stored in external buckets, or reference data not available in the pre-configured options. See How to add workspace-level input data for more details.

We won’t be using these in the data tables quickstart, but you’ll get to work with them when you move on to the Workflows Quickstart tutorial.

3 (optional): Import (realistic) data to a workspace

Now that you've worked with the Quickstart mock data, you hopefully understand a bit more about data tables in your workspace and why they’re useful when working in the cloud in Terra.

Now let's explore more realistic data tables (in the Showcase Workspaces Library) and walk through how to add one to your workspace. To see a more relevant example of a data table in Terra, pick a workspace from the Featured Workspaces Library. You can filter (on the left) by scientific use case or experimental strategy. Once you've found a workspace that interests you, follow the steps below to import it to your workspace to explore.

Step-by-step instructions to copy data from another workspace

1. Go to the Data page.

2. Click on the three vertical dots to the right of the table to import.

3. Click Export to workspace.

4. Choose your workspace from the dropdown and click the blue Copy button.

Modifying a template data tableOnce you copy the data table to your own workspace you can edit it right in Terra (see How to edit and modify data tables for more details). You can also download a TSV and modify with your favorite spreadsheet editor. Note that the first column header for a sample table, for example, sometimes has the formatting entity:sample_id. The "entity:" and "_id" parts are optional.

Takeaway and next steps (analyze the data)

Now that you've completed the Data Quickstart, you should know/understand

Data tables in Terra work like spreadsheets
How to modify/organize data in a data table

Next: Run a workflow to process the raw data (optional)

You've explored the mock study data, and it's time to run an analysis!

In the Workflows-Quickstart you will run a pre-configured workflow to get the total GPA (averaged over all three subjects) for students in the 8-person cohort and then set up a workflow from scratch to process the complete dataset of 86 students.

Final step: Plot the results in a notebook

If you don’t need to learn how to set up and run workflows, you can skip right to the Notebooks-Quickstart tutorial to learn how to set up and run an interactive analysis to visualize the processed data.