Get up and running workflows on Terra in less than half an hour. This is the second in a series of three Quickstarts that walk through a mock study of the correlation between height and grades for a cohort of 7th, 8th, and 9th graders. You will first run a preconfigured workflow, then set up and run the same workflow from a blank configuration card. As a bonus, you can run a follow-up third workflow to analyze data generated by the first exercises.
Workflows tutorial learning objectives
The workflows 101 quickstart is intended to help you become familiar with setting up and running workflows on data stored or referenced in a workspace data table.
After working through the exercises in the quickstart, you will know how to
- Set up and run a workflow to run on single entities in an entity table
- Generate and use a Workspace Data table for workspace-level resources
- Set up and run a workflow to run on a set of entities
- Run a workflow on output data from a previous analysis (bonus)
You will also understand
- How and why to write output data back to the input table
- How using sets can help streamline when you are running multiple workflows on the same subset of data
Workflows quickstart flow
Three steps to complete the workflows quickstart
- Calculate the students' average GPA by running a pre-configured workflow on data in the student table.
- Calculate the students' average GPA by setting up and running a workflow from scratch on data in the student table.
- (optional bonus) Calculate the class average GPA by setting up and running a workflow on generated data from part 2.
Estimated time and cost to completeYou should be able to complete the Quickstart tutorial in about an hour. Running the tutorial will cost less than $0.25 (Google Cloud data storage and VM costs).
You will need to have a Terra Billing project and your own copy of the Quickstart workspace to complete the tutorial.
About the mock study in the T101 Quickstarts
This is the first in a series of three Quickstarts that walk through a completely fake study of the correlation between height and grades for a cohort of 7th, 8th, and 9th graders.
- Data tables quickstart: Explore survey data from 86 students in the study
- Workflows quickstart: Run a workflow to calculate the cumulative GPA of students in the study
- Notebooks quickstart: Run a Jupyter notebook to plot height versus GPA
Your mission, should you choose to accept it, is to discover if there is a correlation between student height and grades by the end of the Quickstarts.
First: Make your own copy of the T101 Workflows Quickstart workspace
The T101-Workflows-Quickstart (featured workspace) is “Read only”. For hands-on practice, you'll need to be able to upload data to workspace storage, which has a cost. Making your own copy of the Data-Tables-Quickstart workspace gives you that power. If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone" from the dropdown menu:
- Rename your copy something memorable
It may help to write down the name of your workspace
- Choose your billing project
Note that this can be free credits! Don’t worry, you’ll have plenty left over when you’ve completed the Quickstart exercises.
- Do not select an Authorization Domain, since these are only required when using restricted-access data
- Click the “Clone Workspace” button to make your own copy
- Rename your copy something memorable
Once you're in your own copy of the workspace, you can get hands-on to learn about data tables!
Workflows Quickstart step-by-step guide and video
Once you're in your own copy of the workspace, you can get hands-on to learn about analyzing data with workflows!
Video walkthrough instructions
Exercise 1: Run a preconfigured workflow on student data
In this exercise, the workflow has already been set up for you, so all you need to do is select students (data) and launch the workflow.
What you will learn
This exercise will give you a feel for the mechanics of running a workflow as well as how to monitor a workflow once you submit it.
Step-by-step instructions to run your first workflow
1.1. Start by going to the Workflows page.
1.2. Select the 1_CalculateStudentGPA workflow (click the card). This will reveal the workflow configuration form where you'll set up the workflow to run on your data.
1.3. Confirm root entity type = "student". The root entity type is the table that contains the input data.
1.4. Click the "Select Data" button. This will take you to the Select Data form.
1.5. Select all students by clicking the box at the top of the first column.
1.6. Click the blue OK button to finalize your selection.
1.7. Click the run analysis button and launch the workflows. Terra will launch 86 workflow jobs in parallel (one for each student).
1.8. Refresh the Job History page to monitor the submission status.
1.9. When the job is complete (you'll see a green checkmark in the Status column), go back to the Data page, click on the student table to open it and answer the following questions.
The "root entity type" is the table that contains the input data.
In this exercise, it is the student table (with the arrow pointing to it). The data the workflow will use are each student's GPAs for language arts, math, and science (circled in the screenshot below).
After running the workflow, there is an additional column in the student data table.
The Cumulative_GPA column (circled in the screenshot below) stores the output data from running the workflow.
Where did the new column come from?
The workflow was configured to write outputs back to the input data table.
To see this, go back and look at the Outputs tab of the workflow configuration form (click the 1_CalculateStdentGPA card and the Outputs tab).
Notice the name of the new column is the same as the attribute for the output variable GPA.
Exercise 2: Set up and run the workflow from scratch
In this exercise, you will run the same workflow, but this time the configuration card is blank.
What you will learn
This walks you through setting up the workflow from scratch. You will need to add the input attributes for this workflow yourself, using exercise 1 for reference.
Step-by-step instructions to set up a workflow
In addition to choosing what students to analyze, there are two additional steps to configure a workflow to run on data in a table:
- Specify **input** values (i.e. what column in the data table corresponds to what variables in the WDL)
- Set outputs to be written back the data table
Step 1: Specify input values
First, select the 2_CalculateStudentGPA workflow (click the card). You will need to fill in the attribute fields for all the required variables.
2.1. Click the Select data button and select the Choose Existing Sets of Students radio button.
2.2. Choose the student-subset-8 (created in the Data Tables Quickstart).
2.3. Go to the Input tab of the setup form.
2.4. Start by clicking into the first attribute field.
2.5. Select the appropriate attribute from the dropdown menu.
Hint: Use the variable name (second column) to help figure out what attribute to choose.
What does this formatting mean?The prefix
this. tells Terra to look in the root entity table. The drop down includes all the columns (possible input data) from the root entity table.
2.6. Repeat for each variable with a blank attribute field except num_scores (you'll do this next!).
2.7. Click the blue Save button at the top right of the form.
Step 2: Configure workspace-level variables
The third variable, num_scores, is a variable used across all input. In this case, it's the total number of courses the workflow averages over (it's the same value for all students - 3). Such workspace-level variables have a special table, the Workspace Data table (in the Other data section on the left hand side of the Data page).
2.8. Start typing typing
workspace. in the attribute field.
workspace.number-of-courses from the dropdown.
2.10. Click the Save button at the top to save your selection.
What does this formatting mean?The prefix
workspace. tells Terra to look in the Workspace Data table. The dropdown includes all the columns from the table.
What to expect
Step 3: Write outputs back to the table
You can set up the workflow to write outputs back to the input table. In this case, outputs are a number that Terra will add in a new column in the input table. If your workflow generates large data files in workspace storage by default, this step will write the data file URI to the input table, making it much easier to associate outputs with inputs.
2.11. Start in the Outputs tab of the setup form.
2.12. For the first output variable, “gpa”, go to the attribute field and type in "this." + a column name for your output files in the table.
Hint: Use something that is different than
cumulative_GPA (from Exercise 1).
2.13. Click the blue OK button to finalize your selection.
Step 4: Launch the workflow
2.14. Now that you have selected that data and set up the workflow inputs and outputs, you can click the run analysis button to launch the workflows.
Exercise 2 Thought Questions
This workflow accepts single entities as inputs - all the input data are found in a single row in the student table (the subject grades are all different variables corresponding to separate columns in the table).
Answer: The root entity table is defined as the table that contains the primary input for a workflow.
In this exercise, the root entity type is student.
Answer: You specified the columns corresponding to each input variable in the workflow configuration form.
In this exercise, each student's GPAs for language arts, math, and science) are data stored in the
Answer: You specify the table with the input data when you choose the root entity type (arrow).
Then you tell the workflow which column contains the input data (or file) using the format
this.YOUR-DATA-COLUMN-NAMEin the Attributes field of the setup form (circled).
Exercise 3: (optional) Run a follow up workflow on output data
In this exercise, you will use each sudent's cummulativeGPA (outputs from exercise 2) to calculate the class average.
What you will do
- Make a set of the students in seventh grade
- Configure the 3_CaluculateClassGPA workflow to take the 7th graders' GPAs as input and output a single average GPA to the student_set table.
- Run the 3_CalculateClassGPA workflow on the seventh grade set.
What you will learn
This exercise demonstrates how to set up and run a follow-up workflow on generated data.
What is differentThe 3_CalculateClassGPA workflow takes in multiple students' total GPAs (an array of data) and outputs a single value. Because of this nested table structure, the workflow setup is a little more complex.
Step-by-step instructions to set up and run the follow-up workflow
Step 1: Create the set of seventh graders
3.1. Go to the
student table and click the three-dot action menu at the top right of the Grade column to filter by grade.
3.2. Type 7 into the field with the magnifying glass icon (where it says "Exact match filter") and hit enter or return to filter.
3.3. Click the checkbox at the top left of the table to select all the students in seventh grade.
3.4. Click the Edit icon and select Save selection as a set from the dropdown.
3.5. Name the set
7th-graders and Save.
3.6. Select the 3_CalculateClassGPA workflow (click the card).
3.7. Select the root entity type
3.8. Click the Select Data button, and select the 7th-graders set you just created.
Step 2: Configure the workflow
3.8. Take a look at the configuration pane (filled in) to answer the following questions.
This workflow accepts an array as input (the Cumulative_GPA for each student in the class) and outputs a single value for the class.
student_settable. The array of students in the
studentscolumn is the smallest piece of input data.
The formatting of the
subject-scoresvariable attribute demonstrates how to tell Terra where the primary data is when the tables are nested like this.
Task name Variable Type Attribute CalculateStudentGPA subject_scores Array[Float] this.students.Cumulative_GPA
Breaking down the attribute formatting
Each part of the attribute string gives Terra instructions on where to find the data.
this. students. Cumulative_GPA Look in the root entity table Get the id's from the student column Go to this column in the student table for the input
Notice the "s" at the end of student!This is a Terra formatting quirk that you will need to remember. The dropdown only offers columns from the **root entity table**. If your tables are nested, as in this case, you will need to type in the full attribute string correctly!
Step 3: Run the workflow.
3.9. Click the Launch workflow button.
Answer: Although you will be using data from all 29 students in the seventh grade, it is a single workflow, with a single output value.
Answer: There is one output value (the class average GPA) for the entire set. For this exercise, the workflow is configured to write to the
What to expect
Terra will add a column to the table when the workflow is complete. You can find the name of the output if you look at the Outputs in the workflow configuration form. Your student_set table will include the example row below.
student_set_id students class_gpa 7th-graders AV612, BM445, BY969... (29 entities) 1.234
Takeaways and next steps (plot results in a notebook)
After completing the Quickstart, you should know/understand
- How to set up and run a workflow on single entities of data
- How and why to write output files to the input table
- How to set up and run a follow-up workflow on a set of data
Next step: Plot the results in a notebook
Learn how to set up and run an interactive analysis to visualize data in the T101-Notebooks-Quickstart.
Bonus! Along the way, you will answer the question "How does a student's height correlate with their GPA?"
Please sign in to leave a comment.