This article includes step-by-step instructions to run the third of four optional notebooks in the Notebooks Quickstart workspace.
For this demo, you will use a data explorer to create a custom cohort of 1,000 Genomes data from the Terra Data Library.
Before running the notebook you will need to do two steps (step-by-step instructions below)
- Choose and import a custom cohort of 1,000 Genomes data from the Terra Data Library
- Export the cohort to your workspace
This diagram illustrates the platforms and tools you will be using.
Step 1: Explore data in the Data Library
Before running a notebook analysis, you will need data! In this step, you'll
a) Access and explore data using a data explorer in the Data Library and
b) Use selection criteria to define a subset (custom cohort) of participants for analysis.
This step should take 5 - 10 minutes and won't cost anything.
1.1. Go to the "Data Library" at https://app.terra.bio/#library/datasets
1.2. Click the button to browse the "1,000 Genomes Low Coverage" dataset. You can see there are several parameters, with bars that indicate how many participants in the dataset satisfy those parameters. You'll use those parameters to narrow down the dataset to just those subjects you want to study.
1.3. Select the exclusion criteria for your study subset ("cohort") by clicking on one or more bars in the display panes. You can immediately see how many subjects satisfy your criteria.
For example, to restrict your study to participants of South Asian descent whose exome sequencing center was either BGI or BCM, you would choose those criteria in the cards following the screenshots below.
You can see all the selection criteria at the top:
Step 2: Export study data to the Terra workspace
The datasets in Terra's Data Library are integrated with the rest of the platform, making it seamless to export data to a workspace for analysis. By the end of this step, you’ll know how to export a subset of 1,000 Genomes data from the Library to your workspace for analysis.
This step should take a few minutes and won't cost anything.
TIPA note about controlled dataNote that if the data are restricted-access, you will need to link your authorization to your Terra account. For some datasets, you will need linked authorization to view the data using a data explorer. To learn more about linking authorization to access controlled data on external platforms, see this article.
2.1. Click "Save Cohort" (blue button at top right) to save in a Terra workspace. Take note of the number of participants in your cohort (circled in the screenshot below):
2.2. Remember to name your selection something you will remember easily!
2.3. Designate a destination workspace: Choose "Select and existing workspace" and then your copy of this workspace from the dropdown menu
2.4. Click "Import"
You'll be taken to the "Data" tab of your workspace copy. Notice the two tables, a "BigQuery" table which was in the original workspace - and a "cohort" table.
Look around in the data page before you answer!
- Once your export is complete, go back to your workspace and take a look at the data tab. When you export data from the Data Library, Terra generates data tables in your workspace. In this case, Terra generated a "cohort" table when you "exported" your cohort from the Data Library (you may have noted that the BigQuery tables were already in the workspace).
Data tables are similar to spreadsheets that help organize and keep track of data in the cloud that you will use in an analysis in Terra.
Note that often the data files are not actually stored in your workspace bucket - tables include links to files stored in Cloud storage. One advantage of this is that it means someone else (Google, in this case), pays to store the large genomic data files. You just bring what you need into the VM for analysis. Feel free to expand the two tables and poke around to see what the data you jump imported look like.
- The BigQuery tables reference the 1,000 Genomes dataset stored by Google. The first BigQuery table includes participant information and the second includes sample (i.e. genomic) data files. The data are stored in BigQuery tables accessible by anyone.
The information in this table allows Terra to grab the data you need for a notebook analysis
- The cohort table is what you exported from the data explorer. It's not data at all, but a SQL query that returns a list of IDs for those participants that satisfy the the exclusion criteria you specified in Part 1. The actual SQL query is in the fourth column (circled in the screenshot below).
- In Step 4 of the tutorial, Terra will use the information in the tables to get the data you want and bring it into the Cloud Environment VM memory for analysis.
The query language allows you to import the data from only for those participants in your subset by joining the subset IDs (from the SQL search in the cohort table) and the BigQuery tables. Note that you don't have to know SQL programming to compose your query!
To learn more about Terra's native storage options, see Terra architecture and where your files live in it.
Step 3: Run the notebook
Running the Option 3: Data in the Data Library notebook should take less than 20 minutes (including the time to create the virtual machine or cluster) and cost less than $0.25.