Understanding data in the Cloud

Allie Hajian
  • Updated

Managing and organizing large amounts of data stored in multiple locations in the cloud can be challenging. Where is data stored? How do you access the data for analysis? How do you analyze it, and share results with colleagues? How do you organize and track original and generated data and where they are stored? This article helps answer some of these questions if you're working in Terra. Understanding Terra's data-in-the-cloud model can help you work more efficiently. 

Data in the Cloud - A new vision for bioinformatics

Enabling researchers to take advantage of large datasets in the cloud is a fundamental drivers of the Terra platform. In a cloud-based model, rather than each researcher storing and analyzing their own copy, data are stored in central locations for easier access. Bringing researchers to the data in the cloud helps:

  • Facilitate true data sharing
  • Minimize storage costs and copying errors (everyone doesn't need to have their own copy)
  • Streamline data privacy and security administration (the platform maintains and centralizes security protocols)

Cloud-data-model.png

Where's the data in Terra?

Data that's not stored and analyzed locally can seem distant and not intuitive. When we talk about data in a Terra workspace, we're really talking about data that is linked in some way to your workspace, not data that's actually "in" your workspace. In many cases, when you analyze the data, you won't copy it at all to your workspace bucket - all the analysis is done in the cloud and only (some of) the generated data may be deposited in the workspace bucket.

Generally, data you and your colleagues analyze in a Terra workspace will be in one or more of three locations:

Data_in_cloud_3_locations.jpeg

  1. Your Virtual Machine or cluster - The workspace "cloud environment" is a virtual computer requested and set up by Terra (using Broad's Cromwell engine and GCP's API). When you spin up a cloud environment runtime, you'll set the amount of memory on your virtual disk. When doing interactive analyses (Jupyter notebooks or RStudio, for example), the generated output is stored here. Any data you want to keep should be moved to more permanent (i.e. GCS) storage. See this article to learn more. 

  2. Your Workspace bucket - Data generated by a workflow analysis (WDLs) are stored by default in the workspace Google bucket. You can also move local data or data generated in an interactive analysis to your workspace bucket. If you need to upload data to your workspace bucket, see this article

  3. Other storage - Ideally the bulk of data you work with will be in some other data storage in the cloud, which Terra can access as long as you have the right permissions and authorization. Examples include data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Data Library. 

How to access, organize, and interact with data in Terra

Manage data with workspace tables

Data tables, which can be found in the DATA page of a workspace, provide an interface for organizing data and its metadata, including links to Google buckets. They're like giant, expandable spreadsheets  that can keep track of study participants, participant IDs, phenotypes, links to samples in Google buckets, and more. They make it easier to collaborate, and make your work more reproducible. And a flexible design helps you keep as much metadata as you need in one place. 

Show me more (click for screenshots of genomic and phenotypic data in a table)

Example 1: Genomic data table

Tables can help keep track of genomic data -both original and generated data files - no matter where the data are physically located. A table of genomic data must have  at least two columns to hold: 1) the unique ID for each distinct entity and 2) a link to the data file (the "cram_path" column below is a link to a CRAM file in a Google bucket) . The table can include as many other columns as you need  - for example, for additional metadata (such as the data type -see  below - or when and how the data were collected):

Data-QuickStart_Part1_sample-table.png

Example 2: Phenotypic data table

You can store phenotypic data directly in a workspace table. A shared unique ID (such as the participant_id) links a participant's phenotypic data to genomic data in a different table:

Data-Quickstart_Part1_participant-table.png

 

Why use tables?

Setting up workflows and notebooks to write output metadata to the table allows you to keep intermediate and other output files associated with the input files in the same table, no matter where the files are physically stored.

Though they require a bit of initial setup time, data tables can be enormously useful, especially as the amount of data grows. Imagine keeping track of hundreds or thousands of participants and their data as easily as one or two!

For more details about populating the workspace data table, see this article.


Data-QuickStart:  hands-on practice creating and using workspace tables

The Terra-Data-Quickstart workspace includes hands-on exercises to help understand how to generate and use workspace tables to organize data in your project analysis.  

Browse integrated Data Libraries

Terra's data Library features large data sets already in the cloud, including both public-access sets like 1,000 Genomes and restricted-access sets like UKBiobank. Accessing data in the Library is streamlined because they're integrated with the platform. There's usually a simple "Import" button that adds data in a pre-formatted data table to the workspace you choose.

Tell me more!

The Data Biosphere is one of Terra's key offerings. The Terra Library links to a number of datasets with a variety of clinical and genomic information, wrapped in convenient Data Explorers. Each Data Explorer is managed by the data host.

To learn more about the datasets accessible through the Terra library, see this article. For instructions on using these data in your own workspaces, see this article.

To learn more about hosting your own Data Explorer, read here.

Manipulate and move data with your virtual cloud environment

For another means of transferring files between online data stores - such as your workspace bucket, your docker-based virtual machine, and other external storage - look to the virtual cloud environment that powers Terra's built-in interactive analysis tools (i.e. Jupyter Notebooks and RStudio). To learn more, see this article. 

Next steps and additional resources

We have several tutorials to help understand Terra's Cloud structure and practice interacting with data! 

Terra-Workflows-QuickStart

Practice setting up, launching, and monitoring workflows to analyze genomic data in Terra. The tutorial uses two file format conversion workflows, which run quickly and inexpensively on downsampled data we provide.

Terra-Notebooks-QuickStart

Learn how to access and analyze data from the Data Library in an interactive Jupyter notebook. The workspace includes step-by-step instructions from start to finish:

  • Step 1: Browse 1,000 genomes data in the Data Library and define a subset of data (cohort) for analysis
  • Step 2: Save cohort data from the Terra Data Library to the workspace
  • Step 3: Set up a Jupyter notebook virtual application to analyze the data
  • Step 4: Analyze the data in an interactive Jupyter notebook

Terra-Data-QuickStart 

Learn first-hand how to use workspace tables to help organize, access, and analyze data - and sets of data - in the cloud.

  • Part 1: Examine tables and run a workflow on a single specimen
  • Part 2: Make your own data table and add to the workspace
  • Part 3: Understanding sets of data - Analyze sets of single entities
  • Part 4: Sets again! Workflows that take sets (arrays) of entities as input

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.