Understanding data in the Cloud

Where is data "in the cloud" actually stored and analyzed? How do you organize and access data stored in external repositories, and track data you generate in an analysis on Terra? How do you share with colleagues? This article helps you understand Terra's data-in-the-cloud model so you can work more efficiently.

Big data in the Cloud: a new vision for bioinformatics

The Terra platform is designed to let you take advantage of large datasets in the cloud, without having to make and store your own copy locally like you would with traditional bioinformatics (below left).

Why data in the cloud?

In a cloud-based model (diagram below - right), data files are stored in central locations for easier access, reduced storage costs, fewer copying errors, and streamlined centralized data privacy and security administration.

Traditional Bioinformatics

(everyone has their own copy of data)

diagram of traditional model with each researcher making their own copy of a large dataset

Data copying, not data sharing
High data storage costs
Non-reproducible results
Individual security implementations

Cloud-based bioinformatics

(everyone accesses data from a central place)

diagram of a cloud-based bioinformatics model where datasets are stored in a central repository and individuals access the primary data but don't copy it

True, immediate data sharing
Minimized storage costs & copying errors
Streamlined data privacy and security administration

Work in Terra to take advantage of large datasets in the cloud without paying to make or store new copies.

How does Terra access data in the cloud?

Just because large primary data files are stored outside of Terra doesn't mean you need to copy them to workspace storage for analysis. You just store the cloud location (Uniform Resource Identifier - URI), which you specify as input data for your analysis. Terra takes care of localizing the data files to the virtual machine (VM) that runs the analysis.

All the analysis is done on VMs in the cloud and only (some of) the generated data may be deposited in workspace storage.

Costs of storing and accessing data in Terra

To learn more about the cloud costs for storing and accessing data in the cloud, and how you are charged, see Overview: Terra costs and billing.

Where does data analyzed in Terra live?

When you work on a local machine or data cluster, you know exactly where your data lives: on physical hard drives attached to the computer or cluster where you do your analysis. Data in the cloud, on the other hand, can seem distant and non-intuitive.

Where is data “in the cloud” that you analyze in Terra actually stored? How do you pay for it?

Data is in the cloud - but with Terra-specific integrations and security"Data in Terra” is actually stored in public cloud infrastructure. Large data files can be stored in workspace storage, external cloud storage, or data repositories. Tabular data is stored in Terra infrastructure and displayed as workspace data tables. It's integrated in a way that lets you organize and analyze it without leaving Terra. Terra takes care of bringing data from wherever it's stored to the VM running your analysis. Built-in data security features let Terra access controlled data you are authorized to use.

Where your data is stored depends on what kind of data it is

Data “in a Terra workspace" falls into two different data types - unstructured data and tabular data. Each type has its own cloud storage mechanism in Terra.

Unstructured data (e.g., large genomic data files, images, TSVs)

Generally, large data files you and your colleagues analyze in Terra - and other unstructured data
files you want to keep - will be in one of three locations.

diagram of data in the cloud in Terra. The cloud is public GCP infrastructure - noted by a cloud shape and Google logo. An external Google bucket is labeled with a one. Within the Google cloud is a Terra workspace, which includes workspace storage (dedicated Google bucket) labeled two. A persistent disk, labeled three, is also located inside the workspace perimeter.

1. External cloud storage

Ideally, the bulk of large primary data you work with will be in cloud storage external to Terra: data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library. As long as you have the right permission and authorization, Terra can access it for you when you run an analysis. To access controlled data, you must link your authorization to Terra.

You don't pay to store this data (though you do pay for data generated by your analyses, that you keep).

2. Workspace cloud storage (Google bucket)

Each Terra workspace comes with a dedicated storage container (Google bucket), optimized for storing unstructured objects (data that doesn't adhere to a particular data model or definition, such as text or binary data) in Google Cloud.

You can upload primary data stored locally to your workspace storage for analysis in Terra. If you need to upload data to workspace storage, see How to move data to/from a Google bucket.

Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see How to move data to/from a Google bucket.

You pay the Google storage cost for data in your workspace storage bucket (learn more about Google Cloud storage costs here).

All newly created GCP workspace buckets will have Autoclass turned on by default. Autoclass automatically moves data to colder storage classes to reduce storage costs using a predefined lifecycle policy. There are no early deletion charges, no retrieval charges, and no charges for storage class transitions. For more information, see Google's documentation on Autoclass.

3. Your persistent disk (PD)

The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra. When you spin up a cloud environment VM, you'll set the size and type of your detachable persistent disk (PD). When running Galaxy, Jupyter Notebooks, or RStudio, the generated output is stored in your PD by default.

You pay the GCP cost (per month) of the PD you select. You can see how much you are paying for persistent disk storage in your Cloud Environments page (Profile > Cloud Environments).

Any data you want to share with colleagues or use as input for a workflow should be moved to workspace storage (i.e., Google bucket). See How (and why) to save data generated in a notebook to workspace storage to learn more.

Using tables to keep track of data files and metadata in the cloudAccess to vast amounts of data files stored in different cloud locations is great if you can keep it organized. A Terra workspace includes built-in spreadsheet-like "tables" to help keep track of unstructured data files and associated metadata, as well as store primary tabular data (i.e. clinical, demographic, or phenotypic data). Sample data and associated metadata for participants in a study - such as sample collection dates, sequencing and processing details, and cloud locations - can be stored in a sample table. You can link the sample data to the participant data in a separate table.

The payoff of investing time to set up data tables
Tables that keep track of large data files in cloud storage and their metadata take time to set up. But the tables can store not only the file cloud location (URI) but an unlimited amount of useful metadata. Once set up, they will help you
- Organize large amounts of data from different cloud locations
- Track and associate data generated in a workflow with the original sample
- Scale and automate a workflow analysis

This built-in organization is especially useful as studies and analyses become larger and more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.

Tabular data (i.e., clinical, demographic, or phenotypic data)

You'll store and organize tabular data in integrated, spreadsheet-like data tables.

diagram of Google cloud with a Terra workspace inside. The data table - highlighted with a circle - is inside the Terra workspace perimeter , along with the workspace storage (Google bucket) and persistent disk storage

Data stored in a table in Terra

Primary data in tabular format including clinical data, demographics, or phenotypic data
Input data file locations (e.g., URLs for files in your workspace cloud storage or in external storage locations)
Input data file metadata (e.g., dates of sample collection, or details about sample preparation)

Data tables are hosted in a relational database that is owned and managed by Terra.

Data tables video

To learn more, see Managing data with tables.

Next step: Try the T101 Data Tables Quickstart

The T101 Data Tables Quickstart is a self-guided tutorial to help you learn more about data tables in Terra. You'll get hands-on practice exploring and manipulating data tables in a workspace to understand how tables can help when working with data in the cloud.

You'll need to copy the T101 Data Tables Quickstart workspace to your own billing account and work through the three exercises following the step-by-step guide.