Notebooks Quickstart Guide

Allie Hajian
  • Updated

Learn the basics of how to customize and launch and a Notebook analysis on Terra with the Terra Notebook Quickstart tutorial guide. See learning objectives and time and cost to complete each part, as well as step-by-step instructions, below. Note that because the four parts build on each other, you need to do them in order. However, you don’t need to complete them in one setting.

G0_tip-icon.png


QuickStart learning objectives + time and cost to complete

  What you will learn
    1) How to create a custom cohort from data in the Data Library
    2) How to bring data (from the Data Library) into a notebook for analysis
    3) How to set up the virtual Cloud Environment to run a notebook analysis
    4) How to work in a Jupyter notebook 

How much will it cost? How long will it take? 
The tutorial has four parts - each takes between five and twenty minutes to complete. The entire QuickStart should take around forty minutes and will cost less than $1.00 (GCP compute charges).


Contents

Make your own copy of the Quickstart workspace
Notebooks Quickstart Overview
Step 1: Explore data in the Data Library
Step 2: Export study data to the Terra workspace
    - Thought exercises: Where’s the data?
Step 3: Run the setup notebook
Step 4: Run the tutorial notebooks
Additional notebooks resources

First - Make your own copy of the Quickstart workspace

The Terra-Notebooks-Quickstart workspace is “Read only”. For hands-on practice, you'll need to be able to spin up a Cloud Environment and run a notebook. Making you own copy of the workspace gives you that power. If you haven't already done so, you'll need to make your own copy of the workspace following the directions below.

Start by clicking on the round circle with three dots at the upper right hand corner and select "Clone from the dropdown menu. Then follow the directions below to complete the form:
Notebooks-QuickStart_Clone-workspace_Screen_shot.png

Step-by-step instructions + video tip

Notebooks_-QuickStart_Clone-workspace_Screen_shot.png
  1. Rename your copy something memorable
    It may help to write down the name of your workspace

  2. Choose your billing project
    Note that this can be "getting started" credits
    from GCP
    ! Don’t worry, you’ll have plenty left
    over when you’ve completed the QuickStart
    exercises.

  3. Do not select an Authorization Domain, since these are only required when using restricted-access data

  4. Click the “Clone Workspace” button to make your own copy

 

Notebooks Quickstart Overview

Once you're in your own copy of the workspace, you'll be ready to practice setting up and running an interactive analysis in a notebook. Notebooks let you interact with your data in real time, and because they include documentation, you don't have to be a coding expert to run a notebook analysis. The four steps in the Quickstart tutorial are:

Notebooks_QuickStart_flow.png

This diagram illustrates the platforms and tools you will be using. 

Notebooks-QuickStart_symbol_flow.png

Notebooks in this workspace

What is a notebook?
A Jupyter notebook is an application that runs in a virtual Cloud Environment, which includes a VM and a detachable Persistent Disk. The notebook includes executable code (in either Python or R, and markdown for documentation.

0_Introto Jupyter_notebooks (optional)
If you're not familiar with Jupyter notebooks, this gives a general intro (or refresher) on basics like how to insert a markdown or code cell, how to run code, etc.

1_R_environment_setup
This notebook introduces setting up a Cloud Environment and running a notebook in a Terra workspace. The setup notebook installs additional libraries and packages you need for your analysis. You will run this notebook every time you create a new Cloud Environment Persistent Disk (“Cloud Environment”) in this workspace. 

2_BigQuery_cohort_analysis
Here is where you will bring the cohort data from Parts 1 and 2 of the tutorial into the Cloud Environment Persistent Disk for analysis. A couple of plotting functions serve as a proxy for an analysis.

3_Access_and__plot_public_BigQuery (optional)
Explore two additional ways to access data in the
cloud (unstructured data in a Google bucket and tabular data in BigQuery) in a notebook.


Step 1: Explore data in the Data Library

Before running a notebook analysis, you will need data! In this step, you'll
  1) Access and explore data using a data explorer in the Data Library and
  2) Use selection criteria to define a subset (custom cohort) of participants for analysis.

This step should take 5 - 10 minutes and won't cost anything.

Step-by-step instructions to explore data in the Data Library

1.1. Go to the "Data Library" at https://app.terra.bio/#library/datasets

1.2. Click the button to browse the "1,000 Genomes Low Coverage" dataset. You can see there are several parameters, with bars that indicate how many participants in the dataset satisfy those parameters. You'll use those parameters to narrow down the dataset to just those subjects you want to study. 

1.3. Select the exclusion criteria for your study subset ("cohort") by clicking on one or more bars in the display panes. You can immediately see how many subjects satisfy your criteria. 

Tutorial example
For example, to restrict your study to participants of South Asian descent whose exome sequencing center was either BGI or BCM, you would choose those criteria in the cards following the screenshots below:

Notebooks-QuickStart_Part-1_Super-population-criteria_Screen_shot.png Notebooks-QuickStart_Part-1_Exome-center-criteria_Screen_shot.png

You can see all the selection criteria at the top:

Notebooks-QuickStart_Part-1_Selection-criteria-total_Screen_shot.png


Step 2: Export study data to the Terra workspace

The datasets in Terra's Data Library are integrated with the rest of the platform, making it seamless to export data to a workspace for analysis. By the end of this step, you’ll know how to export a subset of 1,000 Genomes data from the Library to your workspace for analysis. 

This step should take a few minutes and won't cost anything.

A note about controlled data
Note that if the data are restricted-access, you will need to link your authorization to your Terra account. For some datasets, you will need linked authorization to view the data using a data explorer. To learn more about linking authorization to access controlled data on external platforms, see this article

Step-by-step instructions to import data from the Data Library to a workspace

2.1. Click "Save Cohort" (blue button at top right) to save in a Terra workspace. Take note of the number of participants in your cohort (circled in the screenshot below):

Notebooks-QuickStart_Save-cohort_Screen_shot.png

2.2. Remember to name your selection something you will remember easily!

2.3. Designate a destination workspace: Choose "Select and existing workspace" and then your copy of this workspace from the dropdown menu

Notebooks-QuickStart_Part-1_Select-existing-workspace_Screen_shot.png

2.4. Click "Import"
You'll be taken to the "Data" tab of your workspace copy. Notice the two tables, a "BigQuery" table which was in the original workspace - and a "cohort" table.

Thought exercise: Where's the data?

Once your export is complete, go back to your workspace and take a look at the data tab. When you export data from the Data Library, Terra generates data tables in your workspace. In this case, Terra generated a "cohort" table when you "exported" your cohort from the Data Library (you may have noted that the BigQuery tables were already in the workspace). 

Notebooks-QuickStart_Part-2_Data-tables_Screen_shot.png

What are data tables and where are the data?

Data tables are similar to spreadsheets that help organize and keep track of data in the cloud that you will use in an analysis in Terra.

Note that often the data files are not actually stored in your workspace bucket - tables include links to files stored in Cloud storage. One advantage of this is that it means someone else (Google, in this case), pays to store the large genomic data files. You just bring what you need into the VM for analysis. Feel free to expand the two tables and poke around to see what the data you jump imported look like. 

What is in the BigQuery tables? Click for answer!

The BigQuery tables reference the 1,000 Genomes dataset stored by Google. The first BigQuery table includes participant information and the second includes sample (i.e. genomic) data files. The data are stored in BigQuery tables accessible by anyone.

Notebooks-QuickStart_Part-2_Big-Query-tables_Screen_shot.png

The information in this table will allow Terra to grab the data you need for a notebook analysis - you'll see that in Part 4!

What is in the cohort table? Click for answer!

The cohort table is what you exported from the data explorer. It's not data at all, but a SQL query that returns a list of IDs for those participants that satisfy the the exclusion criteria you specified in Part 1. The actual SQL query is in the fourth column (circled in the screenshot below). 

Notebooks-QuickStart_Part-2_Cohort_table_Screen_shot.png

How will Terra use the tables to get the data for a notebook analysis? Click for the answer!

In Step 4 of the tutorial, Terra will use the information in the tables to get the data you want and bring it into the Cloud Environment VM memory for analysis.

The query language allows you to import the data from only for those participants in your subset by joining the subset IDs (from the SQL search in the cohort table) and the BigQuery tables. Note that you don't have to know SQL programming to compose your query!

To learn more about where you data "live" in Terra for analysis, see this article


Step 3: Run the setup notebook

You will likely need to install a number of additional libraries and packages in your Cloud Environment VM or cluster. This notebook does that step. Note that since these packages are installed in the Cloud Environment detachable Persistent Disk (PD), you only need to run it once!

You will learn how to
    1) Start and run a notebook in Terra
    2) Customize the notebook Cloud Environment

How much will it cost? How long will it take? 
Running the setup notebook should take about 20 minutes (including the time to create the virtual machine or cluster) and cost less than $0.25.

G0_tip-icon.png


What happens when I open a notebook for the first time? 

  When you first open a notebook in a workspace, Terra creates your Cloud Environment (VM or cluster). This can take 5-10 minutes. During this time, don’t refresh the page or try to resume the notebook. 

During creation, you will see a read-only notebook copy and a note in the top of the browser that Terra is creating the virtual environment (in the orange rectangle)

If you open any notebook again in your workspace, it won't take as long, as Terra will only need to resume the application compute, not create it.


Step-by-step instructions to run the setup notebook

3.1. Click on the “1_R_environment_setup” notebook in the "Notebooks" tab of your cloned copy of the QuickStart workspace

3.2. Spin up a Cloud Environment VM (default settings) by clicking the "Create" button:

Notebooks-QuickStart_Create-Cloud-Environment_Screen_shot.png

To learn more about customizing the virtual Cloud Environment
where your notebook runs, see this article. 

3.2. Click on the "Edit" button at the top of the Preview so you can edit and run the code in the notebook:

Notebooks-QuickStart_Open-in-edit-mode_Screen_shot.png

To learn more about "Edit" versus "Playground" mode, see this article

3.3. Wait for the Cloud Environment VM to start: Note - this can take 4-5 minutes the first time you create a notebook virtual environment in a workspace. 

3.4. Skim the read-only view while the VM spins up to understand what's in the notebook

3.5. Once the VM is up and running, run the first code cell: click in the cell and then click 'Run' from the menu at the top to execute the code in that cell. Note - you can also use the shortcut “shift” + “return” to run a cell. 

3.6. Wait for the * at the left of the cell to turn into a number, i.e. ['*'] --> [4], which indicates that the code in the cell has executed successfully

3.7. Click in the next cell and then click 'Run' to execute the code in the next cell, wait for execution to complete and review the results

3.8. Continue steps 7-9 until you've executed the code in all cells from the notebook. Make sure to read the documentation so you understand the point of each code block. Note that you don't have to be able to code to run a notebook analysis, as long as your notebook has enough documentation!

3.9. When you've run all the code cells, close the notebook by clicking the green x in the top right.

Step 4: Run the tutorial notebooks

There are three tutorial notebooks that cover different applications.     

2_BigQuery_cohort_analysis
In this notebook, you will bring the BigQuery data from the data table to the notebook environment and do a bit of graphing to confirm the data. This notebook is an R-based notebook, but you do not have to understand R in order to run it.

After running this notebook, you should understand how to import the cohort data you imported from the Data Library into the notebook application memory in order to run an analysis. It has some steps to verify the data, but does not do an actual analysis. At this point, you could insert the code cells to do your own analysis.

3_Access_and_plot_BigQuery_data
This notebook walks through three different ways to bring BigQuery data into a notebook for analysis.

4_Working_with_data_in_your_cloud_environment
This tutorial explores the two places to store your own data in Terra - the Cloud Environment Persistent Disk and the Workspace bucket. After going through this notebook, you should understand how and why to move data between the PD, the Workspace bucket and an external bucket.

For more information about Terra architecture and where your data files live in it, see this article.

How much will it cost? How long will it take? 
This notebook should take about twenty minutes and cost less than $0.25 to complete.

Step-by-step instructions to run the tutorial notebooks

Once you understand the basic process of running a notebook such as the setup notebook in Part 3, you will run the tutorial notebooks the same way. There are three tutorial notebooks in the Quickstart. The first two do the same thing, with two different techniques. The third explores how to store and move your own data between the two options for workspace storage.

4.1. Open the notebook in Edit mode following the steps above.

4.2. Read the documentation and run the notebook cells in order 

G0_icon-readme.png


Additional notebooks resources

  To learn more about interactive statistics and visualization with Jupyter notebooks, click here for Terra notebooks documentation. 

To learn more about how to customize the Cloud Environment, see this article

For notebooks-focused featured workspaces, see the Showcase Library

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.