Get up and running workflows on Terra in less than half an hour. This is the second in a series of three Quickstart tutorials that walk through a mock study of the correlation between height and grades for a cohort of 7th, 8th, and 9th graders. In the workflows tutorial, you'll process the study data using a workflow to calculate the average GPA of the students in your study.
Prerequisite
You should have already completed the Data Tables tutorial. You will work in the same copy of the Terra on GCP Quickstart workspace to get hands-on setting up and running workflows.
Workflows tutorial learning objectives
The Workflows Quickstart tutorial is intended to familiarize you with the process of setting up and running workflows on data stored or referenced in a workspace data table.
After working through the exercises in the quickstart, you will know how to
- Set up and run a workflow to run on single entities in an entity table
- Generate and use a Workspace Data table for workspace-level resources
- Set up and run a workflow to run on a set of entities
- Run a workflow on output data from a previous analysis (bonus)
You will also understand
- How and why to write output data back to the input table when running a workflow.
- How using sets can help streamline running multiple workflows on the same subset of data.
Three steps to complete the workflows quickstart
- Calculate the student's average GPA by running a pre-configured workflow on data from the student table for the subset of 8 students you created in the Data Tables tutorial.
- Calculate the average GPAs for all the student dataset by setting up and running a workflow from scratch on data in the student table.
- (optional bonus) Calculate the 7th grade class average GPA by setting up and running a workflow on generated data from part 2.
Estimated time and cost to completeYou should be able to complete the Quickstart tutorial in less than an hour. Running the tutorial will cost less than $0.25 (Google Cloud data storage and VM costs).
Workflows Quickstart step-by-step guide and video
Work in your own copy of the Quickstart workspace Now that you've gotten familiar with the mock study data, you're ready to process it. Running a workflow to calculate each student's grade point average will help you get hands-on to learn about analyzing data with workflows!
Prerequisites
You should already have completed Terra (GCP) Quickstart Part 1: Data tables tutorial in your own copy of the Terra on Terra (GCP) Quickstart workspace.
Video walkthrough instructions
1: Run a preconfigured workflow
In this exercise, the workflow has already been set up for you, so all you need to do is select data (which students to run) and launch the workflow. It's often good to test a workflow on a subset of the data, so you'll run this first workflow on the subset-8 cohort of eight students.
What you will learn
This exercise will give you a feel for the mechanics of running a workflow as well as how to monitor a workflow once you submit it.
Step-by-step instructions to run your first workflow
1.1. Start by going to the Workflows page.
1.2. Select the 1_CalculateStudentGPA workflow (click the card). This will reveal the workflow configuration form where you'll set up the workflow to run on your data.
1.3. Confirm root entity type = "student". The root entity type is the table that contains the input data.
1.4. Click the "Select Data" button. This will take you to the Select Data form.
1.5. Select the Choose Existing Sets of Students radio button.
1.6. Choose the subset-8 set (created in the Data Tables Quickstart).
1.7. Click the blue OK button to finalize your selection.
1.8. Click the run analysis button and launch the workflows. Terra will launch 86 workflow jobs in parallel (one for each student).
1.9. Refresh the Job History page to monitor the submission status.
1.10. When the job is complete, you'll see a green checkmark in the Status column. Go back to the Data page, click on the student table to open it and answer the following questions.
Thought questions
-
The "root entity type" is the table that contains the input data.
In this exercise, it is the student table (with the arrow pointing to it). The data the workflow will use are each student's GPAs for language arts, math, and science (circled in the screenshot below).
-
After running the workflow, there is an additional column in the student data table.
The final_GPA column (circled in the screenshot below) stores the output data from running the workflow.
Where did the new column come from?
The workflow was configured to write outputs back to the input data table in a new column.
To see this, go back and look at the Outputs tab of the workflow configuration form (click the 1_CalculateStdentGPA card and the Outputs tab).
Notice the name of the new column is the attribute value for the GPA variable in the Outputs tab (screenshot above).
2: Set up and run the workflow from scratch
In this exercise, you'll run the same workflow, but this time the configuration card is blank. You'll learn how to set up a workflow with a blank configuration card, which is handy because not all workflows will be pre-configured!
What you will learn
This walks you through setting up the workflow from scratch. You will need to add the input attributes for this workflow yourself, using exercise 1 for reference.
Step-by-step instructions to set up a workflow
In addition to choosing what students to analyze, there are two additional steps to configure a workflow to run on data in a table.
- Specify input values (what column in the data table corresponds to what variables in the WDL)
- Set outputs to be written back the data table
Step 1: Choose input data
First, select the 2_CalculateStudentGPA workflow (click the card).
2.1. Click the Select data button and select all the students by clicking the check box at the top left of the table.
Step 2: Specify input values
You will need to fill in the attribute fields for all the required variables.
2.3. Go to the Input tab of the setup form.
2.4. Start by clicking into the first attribute field.
2.5. Select the appropriate attribute from the dropdown menu.
Hint: Use the variable name (second column) to help figure out what attribute to choose.
What does this formatting mean?The prefix this.
tells Terra to look in the root entity table. The drop down includes all the columns (possible input data) from the root entity table.
2.6. Repeat for each variable with a blank attribute field except num_scores (you'll do this next!).
2.7. Click the blue Save button at the top right of the form.
Step 3: Configure workspace-level variables
The third variable, num_scores, is a variable used across all inputs. In this case, it's the total number of courses the workflow averages over (it's the same value for all students - 3). Such workspace-level variables have a special table, the Workspace Data table (in the Other data section on the left hand side of the Data page).
2.8. Start typing typing workspace.
in the attribute field.
2.9. Select workspace.number-of-courses
from the dropdown.
2.10. Click the Save button at the top to save your selection.
What does this formatting mean?The prefix workspace.
tells Terra to look in the Workspace Data table. The dropdown includes all the columns from the Workspace Data table.
What to expect
-
Task Name
Variable
Type
Attribute
CalculateStudentGPA
language_score
Float
this.GPA_language_arts
CalculateStudentGPA
math_score
Float
this.GPA_maths
CalculateStudentGPA
num_scores
Int
workspace.number-of-courses
CalculateStudentGPA
science_score
Float
this.GPA_science
Step 4: Write outputs back to the table
You can set up the workflow to write outputs back to the input table. In this case, outputs are a number that Terra will add in a new column in the input table. If your workflow generates large data files in workspace storage by default, this step will write the data file URI to the input table, making it much easier to associate outputs with inputs.
2.11. Start in the Outputs tab of the setup form.
2.12. For the first output variable, “gpa”, go to the attribute field and type in "this." + a column name for your output files in the table.
Hint: Use something that is different than final_GPA
(from Exercise 1).
2.13. Click the blue OK button to finalize your selection.
Step 5: Launch the workflow
2.14. Now that you have selected that data and set up the workflow inputs and outputs, you can click the run analysis button to launch the workflows.
Exercise 2 Thought Questions
-
This workflow accepts single entities as inputs - all the input data are found in a single row in the student table (the subject grades are all different variables corresponding to separate columns in the table).
-
Answer: The root entity table is defined as the table that contains the primary input for a workflow.
In this exercise, the root entity type is student.
-
Answer: You specified the columns corresponding to each input variable in the workflow configuration form.
In this exercise, each student's GPAs for language arts, math, and science) are data stored in the
student
table. -
Answer: You specify the table with the input data when you choose the root entity type (arrow).
Then you tell the workflow which column contains the input data (or file) using the format
this.YOUR-DATA-COLUMN-NAME
in the Attributes field of the setup form (circled).
3. (optional) Run a follow up workflow on output data
In this exercise, you will use the seventh grade students' average GPAs (outputs from exercise 2) to calculate the class average.
What you will do
- Make a set of the students in seventh grade
- Configure the 3_CaluculateClassGPA workflow to take the 7th graders' GPAs as input and output a single average GPA to the student_set table.
- Run the 3_CalculateClassGPA workflow on the seventh grade set.
What you will learn
This exercise demonstrates how to set up and run a follow-up workflow on generated data.
What is differentThe 3_CalculateClassGPA workflow takes in multiple students' total GPAs (an array of data) and outputs a single value. Because of this nested table structure, the workflow setup is a little different.
Step-by-step instructions to set up and run the follow-up workflow
Step 1: Create the set of seventh graders
3.1. Go to the student
table and click the three-dot action menu at the top right of the Grade column to filter by grade.
3.2. Type 7 into the field with the magnifying glass icon (where it says "Exact match filter") and hit enter or return to filter.
3.3. Click the checkbox at the top left of the table to select all the students in seventh grade.
3.4. Click the Edit icon and select Save selection as a set from the dropdown.
3.5. Name the set 7th-graders
and Save.
3.6. Select the 3_CalculateClassGPA workflow (click the card).
3.7. Select the root entity type student_set
.
3.8. Click the Select Data button, and select the 7th-graders set you just created.
Step 2: Configure the workflow
3.8. Take a look at the configuration pane (filled in) to answer the following questions.
-
This workflow accepts an array as input (the average-GPA for each student in the class) and outputs a single value for the class.
-
Answer: The
student_set
table. The array of students in thestudents
column is the smallest piece of input data. -
The formatting of the
subject-scores
variable attribute demonstrates how to tell Terra where the primary data is when the tables are nested like this.Task name Variable Type Attribute CalculateStudentGPA subject_scores Array[Float] this.students.average_gpa Remember to replace the variable "average_gpa" with the name you gave the output in part 2!!
Breaking down the attribute formatting
Each part of the attribute string gives Terra instructions on where to find the data.
this. students. average_gpa Look in the root entity table Get the id's from the student column Go to this column in the student table for the input Notice the "s" at the end of student!This is a Terra formatting quirk that you will need to remember. The dropdown only offers columns from the **root entity table**. If your tables are nested, as in this case, you will need to type in the full attribute string correctly!
Step 3: Run the workflow.
3.9. Click the Launch workflow button.
Thought questions
-
Answer: Although you will be using data from all 29 students in the seventh grade, it is a single workflow, with a single output value.
-
Answer: There is one output value (the class average GPA) for the entire set. For this exercise, the workflow is configured to write to the
student_set
table.What to expect
Terra will add a column to the table when the workflow is complete. You can find the name of the output if you look at the Outputs in the workflow configuration form. Your student_set table will include the example row below.
student_set_id students class_gpa 7th-graders AV612, BM445, BY969... (29 entities) 1.234
Takeaways and next steps (plot results in a notebook)
After completing the Workflows Quickstart tutorial, you should be able to
- Set up and run a workflow on single entities of data
- Write output files to the input table (and understand why this is useful!!)
- Set up and run a follow-up workflow on a set of data
Next: Quickstart part 3: Plot the results in a notebook
Learn how to set up and run an interactive analysis to visualize data in the Notebooks Quickstart.
Bonus! Along the way, you will answer the question "How does a student's height correlate with their GPA?"