Understanding data in the Cloud

Allie Hajian
  • Updated

Managing and organizing large amounts of data stored in multiple locations in the cloud can be challenging. Where is data stored? How do you access the data for analysis? How do you analyze it, and share results with colleagues? How do you organize and track original and generated data and where they are stored? This article helps answer some of these questions if you're working in Terra. Understanding Terra's data-in-the-cloud model can help you work more efficiently. 

Data in the Cloud - A new vision for bioinformatics

Enabling researchers to take advantage of large datasets in the cloud is a fundamental driver of the Terra platform. Traditionally, each researcher copied and stored their own data in local repositories. In a cloud-based model, data are stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.

Traditional Bioinformatics
(each researcher has own copy of data)

Data-Traditional-model_diagram.png

  • Data copying, not data sharing
  • High data storage costs
  • Non-reproducible results
  • Individual security implementations

Cloud-based bioinformatics
(bring researchers to the data)

Data_Cloud-based-bioinformatics_diagram.png

  • True, immediate data sharing
  • Minimized storage costs and copying errors
  • Streamlined data privacy and security administration 

Where's the data in Terra?

Data that's not stored and analyzed locally can seem distant and not intuitive. When we talk about data in a Terra workspace, we're really talking about data that is linked in some way to your workspace, not data that's actually "in" your workspace. In many cases, when you analyze the data, you won't copy it at all to your workspace bucket - all the analysis is done in the cloud and only (some of) the generated data may be deposited in the workspace bucket.

Generally, data you and your colleagues analyze in a Terra workspace will be in one (or more) of three locations.

Data_in_cloud_3_locations.jpeg

1. Your Interactive analysis app disk (PD) 

The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra (using Broad's Cromwell engine and GCP's API). When you spin up a cloud environment runtime, you'll set the amount of memory on your virtual disk and a detachable persistent disk (PD). When doing interactive analyses (Jupyter notebooks or RStudio, for example), the generated output is stored in your PD by default. Any data you want to share with colleagues or use as input for a workflow should be moved to more permanent (i.e. GCS) storage. See How (and when) to save data generated in a notebook to the Workspace bucket to learn more.

2. Your workspace cloud storage (Google bucket)

Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (i.e. Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see Moving data to/from a Google bucket (workspace or external).

3. Other (external) storage

Ideally the bulk of data you work with will be in some other data storage in the cloud, which Terra can access for you as long as you have the right permissions and authorization. Examples include data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library. For more information, see Linking authorization/accessing controlled data on external servers

How to manage data with workspace tables

Data tables, which can be found in the Data page of a workspace, provide an integrated way to organize data and metadata, including links to Google buckets. They're like giant, expandable spreadsheets  where you can store study participant IDs and phenotypes - really any kind of structured data. You can associate genomic data stored in the workspace bucket or elsewhere with the participant ID by including links to the data right in the table. They make it easier to collaborate, and make your work more reproducible. And a flexible design lets you keep as much metadata as you need in one place. 

Example 1: Genomic data table

Tables can help keep track of genomic data - both original and generated data files - no matter where the data are physically located. A table of genomic data must have at least two columns containing the following information:

  • The unique ID for each distinct entity (the "sample_id" below)
  • A link to the data file (the "cram_path" column below is a link to a CRAM file in a Google bucket)

The table can include as many other columns as you need - for example, for additional metadata (such as the data type - see below - or when and how the data were collected).

Genomic-data-in-a-data-table_Screen_shot.png

Example 2: Phenotypic data table

You can store phenotypic data directly in a workspace table. A shared unique ID (such as the subject_id) links a participant's phenotypic data to genomic data in a different table.

Phenotypic-data-in-a-workspace-table_Screen_shot.png 

Why use tables?

Though they require a bit of initial setup time, data tables can be enormously useful, especially as the amount of data grows. Imagine keeping track of hundreds or thousands of participants and their data as easily as one or two! In particular, tables can help you do the following:

  • Organize data. Keep track of data in the cloud no matter what kind of data, how much you have, or where it is in the cloud. You can set up an analysis to write output data to the data table, associating generated data automatically with the right input.
  • Automate and scale your analysis. Set up your bulk workflow or notebooks analysis once and run as many times as you need- on batches of almost any size - without any additional work. Setting up workflows and notebooks to write output metadata to the table allows you to keep intermediate and other output files associated with the input files in the same table, no matter where the files are physically stored.
  • Streamline further analysis. Results are automatically associated with input data in the table, which means you can submit the generated data as input without hunting for file paths in the workspace bucket. 

For more details about populating the workspace data table, see Managing data with workspace tables.

Data-QuickStart: Hands-on practice creating and using workspace tables

The Terra-Data-Tables-Quickstart workspace includes hands-on exercises to help you understand how to generate and use workspace tables to organize, access, and analyze data in the cloud.  

  • Part 1: Examine tables and run a workflow on a single specimen
  • Part 2: Make your own data table and add to the workspace
  • Part 3: Understanding sets of data - Analyze sets of single entities
  • Part 4: Sets again! Workflows that take sets (arrays) of entities as input

Browse integrated Data Libraries

Terra's data Library features large data sets already in the cloud, including both public-access sets like 1,000 Genomes and restricted-access sets like UK Biobank. Accessing datasets in the Library is streamlined because they're integrated with the platform: there's usually an option to add data in a pre-formatted data table to the workspace you choose.

Tell me more!

The Data Biosphere is one of Terra's key offerings. The Terra Library links to a number of datasets with a variety of clinical and genomic information, wrapped in convenient Data Explorers. Each Data Explorer is managed by the data host.

To learn more about the datasets accessible through the Terra library, see Working with workspaces: Building workspaces using the Terra Library. For instructions on using these data in your own workspaces, see Accessing and analyzing custom cohorts with the data explorer.

To learn more about hosting your own Data Explorer, see the Data Biosphere GitHub.

Manipulate and move data with your virtual cloud environment

For another means of transferring files between online data stores - such as your workspace bucket, your Docker-based virtual machine, and other external storage - look to the virtual Cloud Environment that powers Terra's built-in interactive analysis tools (i.e. Jupyter Notebooks and RStudio). To learn more, see Managing data and automating workflows with the FISS API. 

Next steps and additional resources

Check out these tutorials to help understand Terra's Cloud structure and practice interacting with data. 

Terra-Workflows-QuickStart

Practice setting up, launching, and monitoring workflows to analyze genomic data in Terra. The tutorial uses two file format conversion workflows, which run quickly and inexpensively on downsampled data we provide.
Workflows-Quickstart_Diagram-of-flow.png

  • Part 1: Run a pre-configured workflow on a single sample (included in workspace data table)
  • Part 2: Configure and run a workflow on a single sample (that you add to the workspace data table)
  • Part 3: Run a downstream (workflow) analysis on a set of samples from the previous parts. 

Terra-Notebooks-QuickStart

Learn how to access and analyze data from the Data Library in an interactive Jupyter notebook. The workspace includes step-by-step instructions from start to finish:

  • Step 1: Browse 1,000 genomes data in the Data Library and define a subset of data (cohort) for analysis
  • Step 2: Save cohort data from the Terra Data Library to the workspace
  • Step 3: Set up a Jupyter notebook virtual application to analyze the data
  • Step 4: Analyze the data in an interactive Jupyter notebook

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.