Data in the Cloud
When learning how to make optimal use of Terra, understanding where the data are stored is an important factor. When using Terra to import or generate any type of data (code, results, etc.), these data will reside in one or more of the following three locations:
- Your Virtual Machine - The VM is a computer cluster requisitioned by Broad's Cromwell engine and GCP's API. When you spin up a cluster, one of the parameters you set is the amount of memory on your VM. When doing Notebook computations, for example, the output is first generated here before it can be sent to another location.
- Your Workspace's dedicated Google Bucket - For quick interaction, such as with Jupyter-based analysis, data can be sent from the VM to a separate cloud-based location called the Workspace Bucket. Data can be sent to the Workspace Bucket either from a Notebook or from a Workflow, making it easy to keep track of what's happening with your data.
- Other storage - For large outputs, Terra will organize your data into Google Buckets as specified by a given task. You can then query or download these data through Terra. If you need to upload data to a Google bucket, see this article.
How to access, organize, and interact data in Terra
Terra's cloud-based model for data is built on the idea that you shouldn't need to copy or store data on your own local machine. Terra streamlines interaction with data in several ways:
- Terra's public data section features large data sets already in the cloud, including both public-access sets like 1,000 Genomes and restricted-access sets like UKBiobank.
- The data model, which can be found in the Data tab of a workspace, provides a simple interface for organizing metadata and linking it to Google buckets.
- Terra's built-in Jupyter Notebooks provide another way to transfer files between online data stores, such as your dedicated workspace bucket, your docker-based virtual machine, and other external storage.
Accessing Public Data
The data biosphere is one of Terra's key offerings. The Terra library contains links to a number of datasets wrapped in convenient interfaces called Data Explorers. Each Data Explorer is managed by the entity hosting the data, and the data include a variety of clinical and genomic information.
If you are part of an organization that is interested in hosting your own Data Explorer, read here.
The Data Model
You can use the in-app workspace data table to organize and keep track of data in the cloud. It's like a giant, expandable spreadsheet that coordinates participants, participant IDs, phenotypes, metadata for samples, and more. Its flexible design helps you keep as much metadata as you need in one place, which can help with collaborations as well as making your work more reproducible. If you configure workflows and notebooks to write output metadata to the table, intermediate and other output files are associated with the input files by default, no matter where the files are physically stored. Though they take a bit of setup time in the beginning, data tables can be enormously useful, especially as the amount of data grows. Imagine keeping track of hundreds or thousands of participants and their data as easily as one or two!
For more details about populating the workspace data table, see this article.
Interacting with Data
The last step in understanding Terra's Cloud structure is to practice interacting with data. We have set up a tutorial to help you with this: The Terra Quickstart featured workspace guides you through a roughly one-hour-long set of exercises to familiarize you with:
How to analyze data interactively with a Jupyter Notebooks primer
How to access and interact with a cohort of data hosted in Terra's Data Library
How to import cohort data to a Terra workspace and run a quick analysis in a notebook
How to link unformatted genomics data in a Google bucket to your workspace data table for processing and analysis
How to set up, launch, and monitor an analysis workflow using the linked data as input
Once you are comfortable with these basics, you can browse the other offerings in our Showcase and Tutorials section to learn more about many of our popular workflows.