Data in the cloud

Allie Cliffe
  • Updated

Where is data "in the cloud" actually stored and analyzed? How do you organize and access data stored in external repositories, and track data you generate in an analysis on Terra? How do you share with colleagues? This article helps you understand Terra's data-in-the-cloud model so you can work more efficiently.

What does “data in a Terra workspace” actually mean?“Data in Terra” is actually stored in public cloud infrastructure that is integrated in a way that lets you analyze and organize it without leaving Terra. Terra takes care of bringing input data (large data files, TSVs or CSVs) from wherever it is to the VM running your analysis. Tabular data is stored in Terra infrastructure and displayed as workspace data tables. Built-in data security features let Terra access controlled data you are authorized to use.

Data in the Cloud - A new vision for bioinformatics

The Terra platform is designed to let you take advantage of large datasets in the cloud. Traditionally, each person copied and stored their own data locally. In a cloud-based model, data files are stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.

Traditional Bioinformatics

(everyone has their own copy of data)

diagram of traditional model with each researcher making their own copy of a large dataset

  • Data copying, not data sharing
  • High data storage costs
  • Non-reproducible results
  • Individual security implementations

Cloud-based bioinformatics

(each person accesses data in a central place)

diagram of a clou-based bioinformatics model where datasets are stored in a central repository and individuals access the primary data but don't copy it

  • True, immediate data sharing
  • Minimized storage costs and copying errors
  • Streamlined data privacy and security administration 

The Terra platform lets you take advantage of large datasets in the cloud without making or paying to store new copies.

How does Terra access data in the cloud?

Just because the primary data is stored outside of Terra doesn't mean you need to copy it to workspace storage to analyze it. When you analyze these large data files, you give the cloud location as input data and Terra takes care of localizing the data files to the virtual machine (VM) that runs the analysis. To access controlled data, you must link your authorization to Terra.  

All the analysis is done on virtual machines in the cloud and only (some of) the generated data may be deposited in workspace storage.

Costs of storing and accessing data in Terra

To learn more about the cloud costs for storing and accessing data in the cloud, and how you are charged, see Overview: Costs and billing in Terra.

Where does data analyzed in Terra live?

When you work on a local machine or data cluster, you know exactly where your data lives: on physical hard drives attached to the computer or cluster where you do your analysis. Data in the cloud, on the other hand, can seem distant and non intuitive. Where is data “in the cloud” that you analyze in Terra actually stored? How do you pay for it?

Where your data is stored depends on what kind of data it is

Data “in a Terra workspace" falls into two different data types - tabular data and unstructured data. Each type has its own cloud storage mechanism in Terra.

Unstructured data (e.g., large genomic data files, images, TSVs)

Generally, large data files you analyze in Terra - and other unstructured data files you want to keep - will be stored in one of two cloud-based locations (or both).

Generally, large data files you and your colleagues analyze in Terra - and other unstructured data
files you want to keep -will be in one of three locations.

diagram of data in the cloud in Terra. The cloud is public GCP infrastructure - noted by a cloud shape and Google logo. An external Google bucket is labeled with a one. Within the Google cloud is a Terra workspace, which includes workspace storage (dedicated Google bucket) labeled two. A persistent disk, labeled three, is also located inside the workspace perimeter.

1. External cloud storage

Ideally, the bulk of data you work with will be in some other data storage in the cloud, which Terra can access for you as long as you have the right permissions and authorization. Examples include data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library.

See Linking authorization/accessing controlled data on external servers.

2. Workspace cloud storage (Google bucket)

Each Terra workspace comes with a dedicated Google bucket storage container, optimized for storing  unstructured object storage (data that doesn't adhere to a particular data model or definition, such as text or binary data) in Google Cloud. 

You can upload primary data stored locally to your workspace storage for analysis in Terra. If you need to upload data to workspace storage, see Overview: Bring your own data to Terra (Azure).

Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (i.e., Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see Moving data to/from a Google bucket (workspace or external).

All newly created GCP workspace buckets will have Autoclass turned on by default. Autoclass automatically moves data to colder storage classes to reduce storage costs using a predefined lifecycle policy. There are no early deletion charges, no retrieval charges, and no charges for storage class transitions. For more information, see Google's documentation on Autoclass.

3. Your Interactive analysis app disk (PD)

The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra. When you spin up a cloud environment runtime, you'll set the size and type of your detachable persistent disk (PD). When doing interactive analyses ( e.g., Jupyter Notebooks or RStudio), the generated output is stored in your PD by default. Any data you want to share with colleagues or use as input for a workflow should be moved to workspace storage (i.e., Google bucket) storage. See How (and when) to save data generated in a notebook to Workspace storage to learn more.

Keeping track of data files and metadata in the cloudAccess to vast amounts of data files stored in different cloud locations is great, if you can keep it organized. A Terra workspace includes built-in spreadsheet-like "tables" to help keep track of unstructured data files and associated metadata. Sample data and associated metadata for participants in a study, such as sample collection dates, sequencing and processing details, and cloud locations can be stored in a sample table. You can link the sample data to participant data in a separate table.

The payoff of investing time to set up data tables
Tables do take time to set up. But once set up, they will help you
   - Organize large amounts of data from different cloud locations
   - Track and associate data generated in a workflow with the original sample
   - Scale and automate a workflow analysis

This built-in organization is especially useful as studies and analyses become more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.

Tabular data (i.e., clinical, demographic, or phenotypic data)

You'll store and organize tabular data in integrated, spreadsheet-like data tables.

diagram of Google cloud with a Terra workspace inside. The data table - highlighted with a circle - is inside the Terra workspace perimeter , along with the workspace storage (Google bucket) and persistent disk storage

Data stored in a table in Terra

  • Primary data in tabular format including clinical data, demographics, or phenotypic data
  • Input data file locations (e.g., URLs for files in your workspace cloud storage or in external storage locations)
  • Input data file metadata (e.g., dates of sample collection, or details about sample preparation) 

Where is tabular data stored?

Data tables are hosted in a relational database that is owned and managed by Terra. 

To learn more, see Managing data with tables.

Next step: Try the Data Tables Quickstart tutorial

The Terra (GCP) Quickstart 1: Data tables tutorial includes everything you need to get hands-on using workspace data tables to organize, access and analyze data - including sets of data - in the cloud. Copy the Terra on GCP Quickstart workspace to your own billing account and work through the three exercises following the step-by-step guide

Terra on GCP Quickstart tutorial workspace | Step-by-step guide

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.