Research in the cloud on Terra

Allie Hajian
  • Updated

Terra's mission is “to help accelerate research by integrating data, analysis tools, and built-in security components to deliver frictionless research flows from data to results.” What does that mean? How can Terra help you with your research? This article summarizes the types of analyses you can do, how to access datasets in the Terra Data Library, and how to get started on Terra.

Working in the cloud versus working locally

Cloud-native analysis enables research on the tremendous amounts of data coming online and has many advantages to working locally. However, working in the cloud requires a mindset shift. In some ways, it's like renting instead of owning your data, tools, and compute resources. Below are some highlights to help define a new bioinformatics in the cloud mental model.  

Data can be (almost) anywhere in the cloud

When analyzing on a local machine or HPC, data is stored locally on a physical disk connected to your computer or HPC cluster. In Terra, your analysis will run in the cloud on data stored in public cloud storage infrastructure like a data repository or Google bucket.

The data doesn't have to be in the same cloud location

You can analyze anything you're authorized to use - as long as it has a Unique Resource Identifier (URI) that lets Terra access it. Although you have dedicated workspace storage in Terra, you don't need any data stored there to use any of Terra's analysis tools. You can pull input data into the virtual machine (VM) from external cloud storage without ever copying or paying to store the primary data.

Managing data in different cloud locations

Your workspace includes a Data page with integrated spreadsheet-like tables to help keep track of different kinds of data, even when the files live in disparate places.

Terra "spins up" the compute resources you need, when you need them

When running an analysis on your HPC cluster or local machine, you use the local (fixed) computer resources. When you run an analysis in your Terra workspace, Terra requests a VM (compute and disk resources) from Google to run your code. Because it's set up when you need it, you can customize and even change the CPU and disk sizes of the VM in Terra when you run your analysis. The VM only exists when it's running.

You pay for cloud resources as you use them

In a traditional model, you pay for fixed storage, computer disk, and CPUs, whether or not you use them (even if you’re using an institutional HPC cluster that is "free," someone is paying for those resources, and in fact, you often pay overhead for the maintenance of this resource). When working in the cloud, you pay for only the Google resources you use. Google costs are passed through a Terra Billing project with no markup and are ultimately covered by a Google Cloud Billing account. 

Terra has two analysis modes, which run on two separate cloud systems

You can run workflows (bulk analysis) or interactive analysis apps (Galaxy, Jupyter Notebooks, or RStudio) within your workspace. The two kinds of analyses run on two separate VMs, which Terra sets up for you. Each analysis system has its own CPU, memory, and hard drive.

It's as if every workspace comes with two separate computers; one set up for workflows; one set up for interactive analysis apps. This means sometimes you need to transfer data even when working in the same workspace - e.g., to run a notebook analysis on data generated in a workflow or to allow colleagues access to data generated from an app like RStudio. 

Security is built-in and enforced via user-defined permissions and roles

  • Workspace Owners precisely control who has access to data and tools when sharing workspaces.
  • You can grant (and revoke) permissions to individuals or groups. Terra groups help manage and automate permissions to segments - such as teammates - and are especially useful when the individuals in those groups change. 
  • For additional protection (and to prevent accidentally sharing with unauthorized people), you can assign an Authorization Domain (AD) to a workspace. Especially useful when working with controlled data, ADs protect both primary and generated data in the workspace and any copies. In the case of controlled data, the AD is restricted to the access list. 

Terra's security extends to controlled data stored in external repositories

  • Although you can "point to" any data with a URI (including DRS URIs), Terra can only access data you're authorized to work with. 
  • Terra will only be able to pull controlled data into the VM if you have linked your authorization (i.e., NIH or dbGaP) to Terra.

See Terra's Security Posture for more details.

Understanding Terra components For an in-depth look at how cloud resources work together for a seamless platform experience, see Terra architecture and where your files live in it.

Analysis tools on Terra Workflows (pipelining) | Jupyter Notebooks | RStudioGalaxy

If you have additional software needs, please join our community to request a feature.

Project data and tools - together in a Terra workspace 

Terra is designed to let you get your work done without the complexities of dealing directly with cloud vendors (Google or Azure). Whether you're interested in running pipelines, performing statistics, or plotting and visualizing your data, you can access and manage all the tools and data you need in a Terra workspace dedicated to your research project while Terra does the heavy lifting of interfacing with Google on your behalf.

Things you can do in a Terra workspace

  • Store data with dedicated cloud storage and Cloud Environment Persistent Disks. 
  • Organize and track data from multiple sources in the cloud in spreadsheet-like data tables.
  • Access and store bulk analysis workflow tools from Dockstore and the Broad Methods Repository.
  • Analyze data with both batch and interactive analysis modes.
  • Collaborate in a shared space with built-in security features.

Workspaces function like a (very powerful) desktop computer, except the working parts are all in the cloud, and you operate it from your browser. 

Data in the cloud and in your workspace

One advantage of working in the cloud is that you can access data beyond what's stored on local disks. Terra can use any data that has a Unique Resource Identifier (URI) - including data files in workspace storage or an external repository like Gen3 - in an analysis.

Diagram of a Terra workspace plus external bucket, all existing in the cloud. The workspace is comprised of-a cloud environment with compute engine and persistent disk storage,  workspace storage bucket and data tables.

Three ways to store/access data in a Terra workspace

  1. Workspace cloud storage (i.e., Google Bucket)
  2. Cloud Environment Persistent Disk storage
  3. Data tables that include cloud location metadata to access files in external locations

Where you store your data depends on what data you have and how you will analyze it (i.e., workflow versus interactive analysis). Read on to learn more.

1. Workspace cloud storage (i.e., Google Bucket)

Each workspace has a dedicated Google bucket for cloud storage.

The workspace bucket

  • Is accessible from outside Terra (by anyone with sufficient workspace permissions or by anyone - if the bucket is public access)
  • Is created/deleted when the workspace is created/deleted
  • Has the same access roles as the workspace
  • Is integrated with workflows (you can pull workflow inputs directly from a table
  • Stores generated data from a workflow by default

How is a workspace bucket different from any other Google bucket? Unlike external Google buckets, the workspace bucket is covered by Terra's built-in security (see Terra security posture to learn more) and interfaces with Terra directly so you can see what’s inside it and manipulate it from the Files section of the workspace Data page. You can access data in a Google bucket - workspace or external - from one or more data tables, which integrate directly with workflows in Terra.

2. Cloud Environment Persistent Disk

The Cloud Environment Persistent Disk is created when you create a Cloud Environment for the first time. It stores data generated from an interactive analysis running in the Cloud Environment. You can customize the disk size and type when you launch Jupyter, RStudio, or Galaxy. 

Cloud Environment Persistent Disks:

  • Are unique to each user (collaborators in the same workspace cannot access each other's PD)
  • Can be customized (size and type) when you launch Galaxy, Jupyter, or RStudio
  • Exist until you explicitly delete them (even if you delete the Cloud Environment)

Galaxy versus Jupyter and RStudio PDsThere is one Cloud Environment (and one PD) for Jupyter Notebooks and RStudio and a second distinct Cloud Environment (and PD) for Galaxy. Note that this means each collaborator can have two Cloud Environments (and two PDs) at the same time in the same workspace. 

3. Data tables (in the workspace data tab)

Data tables are part of the Terra infrastructure. They function as built-in spreadsheets to help keep track of metadata and keep primary and generated data organized.

Types of data in a table

  • Primary data (structured or tabular data) such as phenotypic or demographic data or personal health records
  • URIs (metadata) for large input data files in external or workspace cloud storage. Note that a workspace table can reference primary data files stored in external cloud storage or data repositories for analysis
  • Metadata, including any details you need to associate with large data files (e.g., genomic data)

Datasets in  the Terra  Data Library

Streamlined access to large numbers of large datasets is one advantage of working in Terra. You can use Terra to search and access many public and controlled-access datasets. While you can upload any data to your workspace, you can save money on data storage and egress by analyzing data from an existing repository without re-copying it.

Summary: Where's the data? In Terra, "importing" data is a bit of a misnomer. When you "import" data from an existing repository, you import links to the data in the cloud, not the actual data.

This metadata tells your workspace tools where those data files are located. You don't need to copy the raw data into your workspace to analyze it.

Understanding data storage options in Terra
To learn more about the Terra ecosystem and where your data are stored, see Terra architecture and where your files live in it

Pipelining with workflows

You can perform whole pipelines on Terra -  from preprocessing and trimming sequencing data to aligning and downstream analyses - using workflows. Workflows on Terra are written in the human-readable Workflow Description Language (WDL).

Finding workflows in curated workspaces

You’ll find workflows for analyzing and processing different types of sequencing data in Terra’s Showcase and Tutorials Library. Check out available workflows in these curated workspaces to identify tools matching your research interests.

You can also search for and import workflows from Dockstore or the Broad Methods Repository into your workspace.

Interactive analysis - Jupyter Notebooks, RStudio, and Galaxy

Integrated analysis apps allow you to run complex statistics in real time on large amounts of data and visualize results immediately. Interactive analysis tools run on a Cloud Environment, which includes a virtual machine (VM), storage (VM memory plus a Detachable Persistent Disk), and software

Jupyter Notebooks

Document and share analyses with collaborators inside the Terra platform. Integrated Jupyter Notebooks contain code cells to run interactive analysis (in R or Python) and markdown cells to enable detailed documentation of your analyses and data. 

RStudio

If you're looking for a richer IDE experience for R development than Jupyter Notebooks, try RStudio. 

RStudio on Terra includes

  • Variable explorer, R Markdown editor, debugger, terminal
  • Support for launching RShiny apps
  • First-class Bioconductor support
  • Git integration

Galaxy on Terra

Looking for additional tools that are accessible, reproducible, transparent, and community-centered?

See Galaxy interactive environments to learn more about

  • How to launch a Galaxy instance
  • Navigating the Galaxy interface
  • How to import data to your Galaxy instance
  • How to install additional tools in the tools panel

Interactive analysis customization options

Many components of your Cloud Environment can be fully customized. 

Choose the VM software you need

You can customize the software installed on your VM by selecting one of the preinstalled Cloud Environment Application Configurations. You will find included versions and libraries in each preconfigured option by clicking the "What’s installed on this environment?" link below the dropdown.

Standardize software with a custom Docker

Using the same software application configurations ensures everyone has the same computational environment and gets the same results (when inputting the same data and using the same analysis tools, of course!). 

You can select from several preconfigured software (Jupyter application) setups in the drop-down menu. The software application configurations in the dropdown are curated and up to date, so if you can use one, it's an easy way to keep collaborators on the same page. Suppose one of the preconfigured application options doesn't meet your needs. In that case, you can make your own custom application configuration (i.e., preinstall software and dependencies in the VM) with a Docker image or startup script

The right performance at a cost that's right for you

To balance compute efficiency and cost, Terra lets you choose your Cloud Environment VM's compute power (size and type) and persistent disk storage. The default environment is sufficient for many bioinformatics analyses. Running especially large computations? Choose a Spark Cluster under “Custom” and run in parallel on the machines you specify. 

To learn more about virtual Cloud Environment options, read this guide

Four steps to get started on Terra on GCP

1. Register your account

Register your Google or institutional account at app.terra.bio (this part is free). If you have a Google account, click on the menu at the top left, or follow these step-by-step instructions. If you do not have a Google account, see how to set up a Google account with a non-gmail address.

2. Claim $300 in Google credits 

$300 in Google Cloud credits helps you explore Terra before committing your own grant dollars. See this step-by-step guide.

3. Explore showcase and tutorial workspaces

Look through the entire Showcase and Tutorials Library of more than 30 examples for a variety of curated use cases. Showcase workspaces include descriptions, downsampled data, and cost estimates so you can try different tools and gain the confidence to run on your own data.

4. Join our community

Join our community and see how a cloud-native platform - built by the Broad Institute of MIT and Harvard Data Sciences Platform, and Verily Life Sciences - can transform the way you do bioinformatics research. Use the forum to post questions or search Terra Support for tutorials.

Additional resources

Practice with these data-focused Showcase workspaces

Genomic analyses (GATK4 Best Practices workflows)

Single-cell RNA-seq analyses

Epigenomic analyses

Example notebooks in Showcase workspaces

Explore Jupyter Notebooks-based analyses in Terra's Showcase workspaces. To see a read-only copy, select a workspace below and click the workspace Analyses tab.

To run the notebook, make your own copy (clone) of the workspace

  • Hail-Notebook-Tutorials: Practice genomic analysis with Hail
  • Bioconductor: Explore two Notebooks dedicated to RNA-seq Bioconductor packages 
  • Cumulus: Try a Notebook featuring Pegasus software for single-cell analysis

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.