Learn the basics of how to customize and launch and a Notebook analysis on Terra with the Terra Notebook Quickstart tutorial guide. See learning objectives and time and cost to complete each part, as well as step-by-step instructions, below. Note that because the four parts build on each other, you need to do them in order. However, you don’t need to complete them in one setting.
|What you will learn
1) How to create a custom cohort from data in the Data Library
2) How to bring data (from the Data Library) into a notebook for analysis
3) How to set up the virtual Cloud Environment to run a notebook analysis
4) How to work in a Jupyter notebook
How much will it cost? How long will it take?
First - Make your own copy of the Quickstart workspace
The Terra-Notebooks-Quickstart workspace is “Read only”. For hands-on practice, you'll need to be able to spin up a Cloud Environment and run a notebook. Making you own copy of the workspace gives you that power. If you haven't already done so, you'll need to make your own copy of the workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone from the dropdown menu. Then follow the directions below to complete the form:
Step-by-step instructions + video tip
Notebooks Quickstart Overview
Once you're in your own copy of the workspace, you'll be ready to practice setting up and running an interactive analysis in a notebook. Notebooks let you interact with your data in real time, and because they include documentation, you don't have to be a coding expert to run a notebook analysis. The four steps in the Quickstart tutorial are:
This diagram illustrates the platforms and tools you will be using.
Notebooks in this workspace
What is a notebook?
A Jupyter notebook is an application that runs in a virtual Cloud Environment, which includes a VM and a detachable Persistent Disk. The notebook includes executable code (in either Python or R, and markdown for documentation.
0_Introto Jupyter_notebooks (optional)
If you're not familiar with Jupyter notebooks, this gives a general intro (or refresher) on basics like how to insert a markdown or code cell, how to run code, etc.
This notebook introduces setting up a Cloud Environment and running a notebook in a Terra workspace. The setup notebook installs additional libraries and packages you need for your analysis. You will run this notebook every time you create a new Cloud Environment Persistent Disk (“Cloud Environment”) in this workspace.
Here is where you will bring the cohort data from Parts 1 and 2 of the tutorial into the Cloud Environment Persistent Disk for analysis. A couple of plotting functions serve as a proxy for an analysis.
Explore two additional ways to access data in the cloud (unstructured data in a Google bucket and tabular data in BigQuery) in a notebook.
Step 1: Explore data in the Data Library
Before running a notebook analysis, you will need data! In this step, you'll
1) Access and explore data using a data explorer in the Data Library and
2) Use selection criteria to define a subset (custom cohort) of participants for analysis.
This step should take 5 - 10 minutes and won't cost anything.
Step-by-step instructions to explore data in the Data Library
1.1. Go to the "Data Library" at https://app.terra.bio/#library/datasets
1.2. Click the button to browse the "1,000 Genomes Low Coverage" dataset. You can see there are several parameters, with bars that indicate how many participants in the dataset satisfy those parameters. You'll use those parameters to narrow down the dataset to just those subjects you want to study.
1.3. Select the exclusion criteria for your study subset ("cohort") by clicking on one or more bars in the display panes. You can immediately see how many subjects satisfy your criteria.
For example, to restrict your study to participants of South Asian descent whose exome sequencing center was either BGI or BCM, you would choose those criteria in the cards following the screenshots below:
You can see all the selection criteria at the top:
Step 2: Export study data to the Terra workspace
The datasets in Terra's Data Library are integrated with the rest of the platform, making it seamless to export data to a workspace for analysis. By the end of this step, you’ll know how to export a subset of 1,000 Genomes data from the Library to your workspace for analysis.
This step should take a few minutes and won't cost anything.
A note about controlled data
Note that if the data are restricted-access, you will need to link your authorization to your Terra account. For some datasets, you will need linked authorization to view the data using a data explorer. To learn more about linking authorization to access controlled data on external platforms, see this article.
Step-by-step instructions to import data from the Data Library to a workspace
2.1. Click "Save Cohort" (blue button at top right) to save in a Terra workspace. Take note of the number of participants in your cohort (circled in the screenshot below):
2.2. Remember to name your selection something you will remember easily!
2.3. Designate a destination workspace: Choose "Select and existing workspace" and then your copy of this workspace from the dropdown menu
2.4. Click "Import"
You'll be taken to the "Data" tab of your workspace copy. Notice the two tables, a "BigQuery" table which was in the original workspace - and a "cohort" table.
Thought exercise: Where's the data?
What are data tables and where are the data?
Data tables are similar to spreadsheets that help organize and keep track of data in the cloud that you will use in an analysis in Terra.
Note that often the data files are not actually stored in your workspace bucket - tables include links to files stored in Cloud storage. One advantage of this is that it means someone else (Google, in this case), pays to store the large genomic data files. You just bring what you need into the VM for analysis. Feel free to expand the two tables and poke around to see what the data you jump imported look like.
What is in the BigQuery tables? Click for answer!
The information in this table will allow Terra to grab the data you need for a notebook analysis - you'll see that in Part 4!
What is in the cohort table? Click for answer!
How will Terra use the tables to get the data for a notebook analysis? Click for the answer!
The query language allows you to import the data from only for those participants in your subset by joining the subset IDs (from the SQL search in the cohort table) and the BigQuery tables. Note that you don't have to know SQL programming to compose your query!
To learn more about where you data "live" in Terra for analysis, see this article.
Step 3: Run the setup notebook
You will likely need to install a number of additional libraries and packages in your Cloud Environment VM or cluster. This notebook does that step. Note that since these packages are installed in the Cloud Environment detachable Persistent Disk (PD), you only need to run it once!
You will learn how to
1) Start and run a notebook in Terra
2) Customize the notebook Cloud Environment
How much will it cost? How long will it take?
Running the setup notebook should take about 20 minutes (including the time to create the virtual machine or cluster) and cost less than $0.25.
|When you first open a notebook in a workspace, Terra creates your Cloud Environment (VM or cluster). This can take 5-10 minutes. During this time, don’t refresh the page or try to resume the notebook.
During creation, you will see a read-only notebook copy and a note in the top of the browser that Terra is creating the virtual environment (in the orange rectangle)
If you open any notebook again in your workspace, it won't take as long, as Terra will only need to resume the application compute, not create it.
Step-by-step instructions to run the setup notebook
3.1. Click on the “1_R_environment_setup” notebook in the "Notebooks" tab of your cloned copy of the QuickStart workspace
3.2. Spin up a Cloud Environment VM (default settings) by clicking the "Create" button:
To learn more about customizing the virtual Cloud Environment
where your notebook runs, see this article.
3.2. Click on the "Edit" button at the top of the Preview so you can edit and run the code in the notebook:
To learn more about "Edit" versus "Playground" mode, see this article
3.3. Wait for the Cloud Environment VM to start: Note - this can take 4-5 minutes the first time you create a notebook virtual environment in a workspace.
3.4. Skim the read-only view while the VM spins up to understand what's in the notebook
3.5. Once the VM is up and running, run the first code cell: click in the cell and then click 'Run' from the menu at the top to execute the code in that cell. Note - you can also use the shortcut “shift” + “return” to run a cell.
3.6. Wait for the * at the left of the cell to turn into a number, i.e. ['*'] --> , which indicates that the code in the cell has executed successfully
3.7. Click in the next cell and then click 'Run' to execute the code in the next cell, wait for execution to complete and review the results
3.8. Continue steps 7-9 until you've executed the code in all cells from the notebook. Make sure to read the documentation so you understand the point of each code block. Note that you don't have to be able to code to run a notebook analysis, as long as your notebook has enough documentation!
3.9. When you've run all the code cells, close the notebook by clicking the green x in the top right.
Step 4: Run the tutorial notebooks
There are three tutorial notebooks that cover different applications.
In this notebook, you will bring the BigQuery data from the data table to the notebook environment and do a bit of graphing to confirm the data. This notebook is an R-based notebook, but you do not have to understand R in order to run it.
After running this notebook, you should understand how to import the cohort data you imported from the Data Library into the notebook application memory in order to run an analysis. It has some steps to verify the data, but does not do an actual analysis. At this point, you could insert the code cells to do your own analysis.
This notebook walks through three different ways to bring BigQuery data into a notebook for analysis.
This tutorial explores the two places to store your own data in Terra - the Cloud Environment Persistent Disk and the Workspace bucket. After going through this notebook, you should understand how and why to move data between the PD, the Workspace bucket and an external bucket.
For more information about Terra architecture and where your data files live in it, see this article.
How much will it cost? How long will it take?
This notebook should take about twenty minutes and cost less than $0.25 to complete.
Step-by-step instructions to run the tutorial notebooks
Once you understand the basic process of running a notebook such as the setup notebook in Part 3, you will run the tutorial notebooks the same way. There are three tutorial notebooks in the Quickstart. The first two do the same thing, with two different techniques. The third explores how to store and move your own data between the two options for workspace storage.
4.1. Open the notebook in Edit mode following the steps above.
4.2. Read the documentation and run the notebook cells in order
|To learn more about interactive statistics and visualization with Jupyter notebooks, click here for Terra notebooks documentation.
To learn more about how to customize the Cloud Environment, see this article
For notebooks-focused featured workspaces, see the Showcase Library.