Where is data "in the cloud" actually stored and analyzed? How do you organize and access data stored in external repositories, and track data you generate in an analysis on Terra? How do you share with colleagues? This article helps you understand Terra's data-in-the-cloud model so you can work more efficiently.
Big data in the Cloud - A new vision for bioinformatics
The Terra platform is designed to let you take advantage of large datasets in the cloud, without having to make and store your own copy locally (traditional bioinformatics - below left).
Why data in the cloud?
In a cloud-based model (diagram below - right), data files are stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.
(everyone has their own copy of data)
(everyone accesses data from a central place)
Work in Terra to take advantage of large datasets in the cloud without making or paying to store new copies.
How does Terra access data in the cloud?
Just because large primary data files are stored outside of Terra doesn't mean you need to copy them to workspace storage for analysis. You just store the cloud location (Uniform Resource Identifier - URI) as input data and Terra takes care of localizing the data files to the virtual machine (VM) that runs the analysis.
All the analysis is done on virtual machines in the cloud and only (some of) the generated data may be deposited in workspace storage.
Costs of storing and accessing data in Terra
To learn more about the cloud costs for storing and accessing data in the cloud, and how you are charged, see Overview: Costs and billing in Terra.
Where does data analyzed in Terra live?
When you work on a local machine or data cluster, you know exactly where your data lives: on physical hard drives attached to the computer or cluster where you do your analysis. Data in the cloud, on the other hand, can seem distant and non-intuitive.
Where is data “in the cloud” that you analyze in Terra actually stored? How do you pay for it?
Data is in the cloud - but with Terra-specific integrations and security"Data in Terra” is actually stored in public cloud infrastructure. Large data files can be stored in workspace storage, external cloud storage, or data repositories. Tabular data is stored in Terra infrastructure and displayed as workspace data tables. It's integrated in a way that lets you organize and analyze it without leaving Terra. Terra takes care of bringing data from wherever it's stored to the VM running your analysis. Built-in data security features let Terra access controlled data you are authorized to use.
Where your data is stored depends on what kind of data it is
Data “in a Terra workspace" falls into two different data types - unstructured data and tabular data. Each type has its own cloud storage mechanism in Terra.
Unstructured data (e.g., large genomic data files, images, TSVs)
Generally, large data files you and your colleagues analyze in Terra - and other unstructured data
files you want to keep -will be in one of three locations.
1. External cloud storage
Ideally, the bulk of large primary data you work with will be in cloud storage external to Terra: data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library. As long as you have the right permission and authorization, Terra can access it for you when you run an analysis. To access controlled data, you must link your authorization to Terra. See Linking authorization/accessing controlled data on external servers.
You don't pay to store this data (though you pay for generated data that you keep).
2. Workspace cloud storage (Google bucket)
Each Terra workspace comes with a dedicated storage container (Google bucket), optimized for storing unstructured object storage (data that doesn't adhere to a particular data model or definition, such as text or binary data) in Google Cloud.
You can upload primary data stored locally to your workspace storage for analysis in Terra. If you need to upload data to workspace storage, see Overview: Bring your own data to Terra (Azure).
Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see Moving data to/from a Google bucket (workspace or external).
You pay the Google storage cost for data in your workspace storage bucket (learn more about Google Cloud storage costs here).
3. Your Interactive analysis app disk (PD)
The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra. When you spin up a cloud environment VM, you'll set the size and type of your detachable persistent disk (PD). When running Galaxy, Jupyter Notebooks, or RStudio, the generated output is stored in your PD by default.
You pay the GCP cost (per month) of the PD you select. You can see how much you are paying for persistent disk storage in your Cloud Environments page (Profile > Cloud Environment).
Any data you want to share with colleagues or use as input for a workflow should be moved to workspace storage (i.e., Google bucket) storage. See How (and when) to save data generated in a notebook to Workspace storage to learn more.
Using tables to keep track of data files and metadata in the cloudAccess to vast amounts of data files stored in different cloud locations is great if you can keep it organized. A Terra workspace includes built-in spreadsheet-like "tables" to help keep track of unstructured data files and associated metadata, as well as store primary tabular data (i.e. clinical, demographic, or phenotypic data). Sample data and associated metadata for participants in a study, such as sample collection dates, sequencing and processing details, and cloud locations can be stored in a sample table. You can link the sample data to the participant data in a separate table.
The payoff of investing time to set up data tables
Tables that keep track of large data files in cloud storage and their metadata take time to set up. But the tables can store not only the file cloud location (URI) but an unlimited amount of useful metadata. Once set up, they will help you
- Organize large amounts of data from different cloud locations
- Track and associate data generated in a workflow with the original sample
- Scale and automate a workflow analysis
This built-in organization is especially useful as studies and analyses become larger and more complex. You won't have to worry about keeping track of data (original data files and analysis outputs) manually.
Tabular data (i.e., clinical, demographic, or phenotypic data)
You'll store and organize tabular data in integrated, spreadsheet-like data tables.
Data stored in a table in Terra
- Primary data in tabular format including clinical data, demographics, or phenotypic data
- Input data file locations (e.g., URLs for files in your workspace cloud storage or in external storage locations)
- Input data file metadata (e.g., dates of sample collection, or details about sample preparation)
Data tables are hosted in a relational database that is owned and managed by Terra.
Data tables video
To learn more, see Managing data with tables.
Next step: Try the T101 Data Tables Quickstart
The T101 Data Tables Quickstart is a self-guided tutorial to help you learn more about data tables in Terra. You'll get hands-on practice exploring and manipulating data tables in a workspace to understand how tables can help when working with data in the cloud.
You'll need to copy the T101 Data Tables Quickstart workspace to your own billing account and work through the three exercises following the step-by-step guide.
Please sign in to leave a comment.