Terra's mission is “to help accelerate research by integrating data, analysis tools, and built-in security components to deliver frictionless research flows from data to results”. What does that mean? How can Terra help you with your research? This article summarizes the cloud components you'll use when working in Terra, and how working in the cloud differs from working locally.
Project data and tools - together in a Terra workspace
Whether you're interested in running pipelines, performing statistics, or visualizing your data, you can access and manage all the tools and data you need in a Terra workspace dedicated to your research project.
Things you can do in a Terra workspace
- Store and track data - including from multiple sources in the cloud - in spreadsheet-like data tables.
- Access and store bulk analysis workflow tools from Dockstore and the Broad Methods Repository.
- Analyze data with both batch and interactive analysis modes.
- Collaborate in a shared space with built-in security features.
Workspaces function like a (very powerful) desktop computer, except the working parts are all in the cloud and you operate it from your browser.
Working in the cloud versus working locally
Working in the cloud requires a mindset shift.
- Data in the cloud can be anywhere
When analyzing on a local machine, data is stored locally on a physical disk connected to your computer or HPC cluster. In Terra, your analysis will run in the cloud on data stored in cloud storage (in a data repository, or Google bucket).
Data doesn't have to be in the same cloud location when working in Terra. You can analyze anything you're authorized to use - as long as it has a Unique Resource Identifier (URI) that lets Terra access it. Although you have dedicated workspace storage in Terra, you don't need to have any data stored there to use any of Terra's analysis tools. You can pull input data into the virtual machine (VM) from external cloud storage to analyze, without ever copying or paying to store the primary data. Your workspace includes a Data page to help keep track of data, even when the files live in disparate places, such as access-controlled institutional repositories
- Terra "spins up" the compute resources you need, when you need them
When running an analysis on your HPC cluster or local machine, you use the local (fixed) computer resources. When you run an analysis in your Terra workspace, Terra requests a VM (compute and disk resources) from Google to run your code. Because it's set up when you need it, you can customize and even change the CPU and disk sizes of the VM in Terra when you run your analysis. The VM only exists when it's running.
- You pay for cloud compute resources as you use them
In a traditional model, you pay for fixed computer disk space and CPUs, whether or not you are using them. Even if you’re using an institutional HPC cluster, you pay overhead for the maintenance of this resource. When working in the cloud, you pay for only the Google compute and disk resources you use. Google costs are passed through a Terra Billing project with no markup, which is paid by a Google Billing account.
- There are two kinds of analysis, which run on two separate cloud systems
Within your workspace, you can run workflows (bulk analysis) or interactive analysis apps (Galaxy, Jupyter Notebooks, or RStudio). The two different kinds of analysis run on two different VMs, which Terra sets up for you. Each analysis system has its own CPU, memory, and hard drive.
It's as if every workspace comes with two separate computers, one set up for workflows, one set up for interactive analysis apps. This means sometimes you need to transfer data even when working in the same workspace - e.g., to run a notebook analysis on data generated in a workflow, or to allow colleagues access to data generated from an app like RStudio.
- Built-in security
Although you can "point to" any data that has a URI (including DRS URIs), Terra can only access data you're authorized to work with.
- Terra will only be able to pull controlled data into the VM if you have linked your authorization (i.e., NIH or dbGaP) to Terra.
- Workspace Owners precisely control who has access to the data and tools when sharing workspaces.
- For additional protection (and to prevent accidentally sharing with unauthorized people), you can assign an Authorization Domain (AD) to a workspace. Especially useful when working with controlled data, ADs protect both primary and generated data in the workspace and any copies. In the case of controlled data, the AD is restricted to the access list.
For an in-depth look at how cloud resources are integrated into Terra, see Terra architecture and where your files live in it
Analysis tools on Terra
Pipelining (workflows) | Interactive analysis | Jupyter Notebooks | RStudio | Galaxy
If you have additional software needs, join our community to request a feature
Data in the cloud and in your workspace
One advantage of working in the cloud is being able to access data beyond what's stored on local disks. Terra can use any data that has a Unique Resource Identifier (URI) - including data files in workspace storage or an external repository like Gen3 - in an analysis.
Three ways to store/access data in a Terra workspace
- Workspace cloud storage (i.e., Google bucket)
- Cloud Environment Persistent Disk storage
- Data tables that include cloud location metadata to access files in external locations
Where you store your data depends on what data you have and how you will analyze it (i.e., workflow versus interactive analysis). Read on to learn more.
For a deeper dive on Terra and Google Cloud infrastructure, see Finding your data and tools in Terra (platform architecture)..
1. Workspace cloud storage (i.e., Google bucket)
Each workspace has a dedicated Google bucket for cloud storage.
The workspace bucket
- Is accessible from outside Terra (by anyone with sufficient workspace permissions or by anyone - if bucket is public access)
- Is created/deleted when the workspace is created/deleted
- Has the same access roles as the workspace
- Is integrated with workflows (you can pull workflow inputs directly from a table
- Stores generated data from a workflow by default
How is a workspace bucket different from any other Google bucket? Unlike external Google buckets, the workspace bucket is covered by Terra's built-in security (see Terra security posture to learn more) and interfaces with Terra directly so you can see what’s inside it and manipulate it from the Files section of the workspace Data page. You can access data in a Google bucket - workspace or external - from one or more data tables, which integrate directly with workflows in Terra.
2. Cloud Environment Persistent Disk
The Cloud Environment Persistent Disk is created when you create a Cloud Environment for the first time. It stores data generated from an interactive analysis running in the Cloud Environment. You can customize the disk size and type when you launch Jupyter, RStudio, or Galaxy.
Cloud Environment Persistent Disks:
- Are unique to each user (collaborators in the same workspace cannot access each other's PD)
- Can be customized (size and type) when you launch Galaxy, Jupyter, or RStudio
- Exist until you explicitly delete them (even if you delete the Cloud Environment)
Galaxy versus Jupyter and RStudio PDsThere is one Cloud Environment (and one PD) for Jupyter Notebooks and RStudio and a second distinct Cloud Environment (and PD) for Galaxy. Note that this means each collaborator can have two Cloud Environments and two PDs at the same time in the same workspace.
3. Data tables (in the workspace data tab)
Data tables are part of the Terra infrastructure. They function as built-in spreadsheets to help keep track of metadata and keep primary and generated data organized.
Types of data in a table
- Can store primary data (structured or tabular data) such as phenotypic or demographic data or personal health records; reference data files in Cloud storage
- Can reference URIs (metadata) for large input data files in external or workspace cloud storage
Take advantage of datasets in the Terra Data Library
Streamlined access to large numbers of large datasets is one advantages of working in Terra. You can use Terra to search and access many public and controlled-access datasets. While you can upload any data to your workspace, you can also save money on data storage and egress by analyzing data from an existing repository without re-copying it.
Where's the data? In Terra, "importing" data is a bit of a misnomer. When you "import" data from an existing repository, you are importing links to the data in the cloud, not the actual data.
This metadata tells your workspace tools where those data files are located. You don't actually need to copy the raw data into your workspace to analyze it.
Understanding data storage options in Terra
To learn more about the Terra ecosystem and where your data are stored, see Terra architecture and where your files live in it.
Pipelining with workflows
You can perform whole pipelines on Terra - from preprocessing and trimming sequencing data to aligning and downstream analyses - using workflows. Workflows on Terra are written in the human-readable Workflow Description Language (WDL). You can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.
Finding workflows in curated workspaces
You’ll find workflows for analyzing and processing different types of sequencing data in Terra’s Showcase and Tutorials Library. Check out some of the available workflows in these curated workspaces to identify tools that match your research interests.
You need to be registered on Terra to view Terra workspaces. If you haven't registered yet, follow the registration steps below
You can also find workflows in Dockstore and the Broad Methods Repository.
Interactive analysis - Jupyter Notebooks, RStudio and Galaxy
Integrated apps allow you to run complex statistics and visualization in real time on large amounts of data, and visualize results immediately. Click the title links below for more detailed documentation.
Document and share analyses with collaborators inside the Terra platform. Integrated Jupyter Notebooks contain code cells to run interactive analysis (in R or Python) and markdown cells to enable detailed documentation of your analyses and data.
If you're looking for a richer IDE experience for R development than Jupyter Notebooks, try RStudio.
RStudio on Terra includes
- Variable explorer, R Markdown editor, debugger, terminal
- Support for launching RShiny apps
- First class Bioconductor support
- Git integration
Galaxy on Terra
Looking for additional tools that are accessible, reproducible, transparent, and community-centered?
See Galaxy interactive environments to learn more about
- How to launch a Galaxy instance
- Navigating the Galaxy interface
- How to import data to your Galaxy instance
- How to install additional tools in the tools panel
Choose the VM software you need
Interactive analysis tools run on a Cloud Environment, which includes a virtual machine (VM) and storage (VM memory plus a Detachable Persistent Disk). You can customize the software installed on your VM by selecting one of the preinstalled Cloud Environments on Terra or choosing a custom environment by specifying a Docker container ("Docker") or using a startup script. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.
To learn more about virtual Cloud Environment options, read this guide.
The right performance at a cost that's right for you
To balance compute efficiency and cost, Terra lets you choose the compute power of your Cloud Environment VM. The default environment is sufficient for many bioinformatics analyses. You can also customize the software, compute power, and disk type and size by selecting a custom option. Running especially large computations? Choose a Spark Cluster under “Custom” and run in parallel on the machines you specify.
Four steps to get started on Terra
1. Register your account
Register your Google or institutional account at app.terra.bio (this part is free). If you have a Google account, just click on the menu at the top left, or follow these step-by-step instructions. If you do not have a Google account, see how to set up a Google account with a non-gmail address.
2. Claim $300 in Google credits
$300 in Google cloud credits helps you explore Terra before committing your own grant dollars. See this step-by-step guide!
3. Explore showcase and tutorial workspaces
You aren’t limited to the workspaces suggested in this overview! Look through the entire Showcase and Tutorials Library of more than 30 examples for a variety of curated use-cases. Showcase workspaces include descriptions, downsampled data, and cost estimates so you can try different tools and gain the confidence to run on your own data.
4. Join our community
Join our community and see how a cloud-native platform - built by the Broad Institute of MIT and Harvard Data Sciences Platform and Verily Life Sciences - can transform the way you do bioinformatics research. Use the forum to post questions or search Terra Support for tutorials.
Practice with these data-focused Showcase workspaces
- Terra-Notebooks-QuickStart: Import public-access 1,000 Genomes data from BigQuery, a cloud data warehouse with built in machine learning
- Terra-Data-Tables-QuickStart: Learn how to use data tables to organize, access and analyze data - including sets of data - in the cloud.
- Introduction to TCGA Dataset: Explore controlled-access TCGA data
- ENCODE Tutorial: Import an ENCODE ChIP-seq dataset
Genomic analyses (GATK4 Best Practices workflows)
- GATK4 Exome-Analysis-Pipeline
- GATK4 Whole-Genome-Analysis-Pipeline
- GATK4 Mitochondria-SNPs-Indels-hg38
Single-cell RNA-seq analyses
- HCA_Optimus_Pipeline: Processing workflow for 10x Genomics datasets
- HCA_Smart-seq2_Multi_Sample_Pipeline: Processing workflow for Smart-seq2 datasets
- Cumulus: Workflows for large-scale single-cell and single-nuclei datasets
- DNA-Methylation-Preprocessing: Workflow for conducting methylation analyses
- ENCODE Tutorial: Workflow for ChIP-seq signal enrichment analyses
Example notebooks in Showcase workspaces
Explore Jupyter Notebooks-based analyses in Terra's Showcase workspaces. To see a read-only copy, select a workspace below and click the workspace Analyses tab.
To run the notebook, make your own copy (clone) of the workspace.
- Hail-Notebook-Tutorials: Practice genomic analysis with Hail
- Bioconductor: Explore two Notebooks dedicated to RNA-seq Bioconductor packages
- Cumulus: Try a Notebook featuring Pegasus software for single-cell analysis
Please sign in to leave a comment.