Learn how cloud storage and compute resources are integrated in the Terra platform, and how to store and work with files and analysis tools across the different resources from within your workspace.
Overview: Terra's cloud-native platform architecture)
It may seem like the data and tools for your analysis are all parts of the Terra platform, like a computer at the tip of your finger in a browser. But Terra isn't a single physical computer. It's actually multiple virtual computer systems - all existing in the cloud - working together in a browser interface.
Terra integrates a range of Terra-specific and external Azure and Google Cloud resources to create a unified analysis experience. All of these resources exist in public Azure and Google cloud infrastructure. Terra interfaces with these resources behind the scenes so you can use them seamlessly in your workspace.
In this article, we review the basic Terra architecture, focusing on how your files are referenced and stored across these locations, and how they're accessed when running a workflow or interactive analysis.
- Cromwell runs workflows (top green section in the table below)
- The Cloud Environment runs interactive analysis apps (in blue).
- Workspace Cloud storage (linked to workflows analyses)
- Cloud Environment Persistent Disk (associated with interactive analyses)
- Data tables, which are part of the Terra infrastructure (in green) that keep track of tabular data. Tables are also integrated with the workflow VM in that large files referenced in a table can be used as workflow inputs.
The diagram below shows how a Terra workspace and cloud resources (VMs and storage systems like persistent disks and Google Cloud Storage buckets) are related. This background information can be important as you work with all of these in Terra.
Components in green (workspace, data table) are part of Terra's infrastructure. Parts in white (workspace storage and Cloud Environment VM, Persistent Disk, and VM boot disk) are Google Cloud resources. The Cloud Environment (in blue) runs interactive analysis apps. Note that Terra sets up a separate VM (not shown) when you run a workflow.
Click the sections below to expand for more detail about each facet of the Terra platform.
The heart of Terra is the workspace - where you can organize and access data and tools and perform analyses. It includes built-in interfaces for documentation (the dashboard), data, interactive analysis and bulk analysis (pipelining/workflows) tools, and provenance (a Job History - i.e. record of all submitted workflows).To learn more, see the Workspaces section in Terra Support.
Each workspace has its own unique Google bucket for storage (white rectangle with storage icon inside the pale green workspace above). The workspace bucket exists on the Google Cloud, and is created/deleted when the workspace is created/deleted.
How is a workspace bucket different from any other Google bucket? Unlike external Google buckets, the workspace bucket is covered by Terra's built-in security (see Terra security posture to learn more) and is exposed in the Terra user interface: you can see what’s inside it and manipulate it from the Files section of the workspace Data page. You can point to data in a Google bucket - workspace or external - from one or more data tables. Workflows can use the metadata in the table to pull the data files into the virtual compute engine for analysis.
The bucket is "attached" to the workspace. If you delete the workspace, you delete the bucket and whatever data is stored there. And when you share workspace ownership with colleagues, you also share the workspace Google bucket and its contents with the same permission levels. See Best practices for sharing and protecting data resources to learn more.
All newly created GCP workspace buckets will have Autoclass turned on by default. Autoclass automatically moves data to colder storage classes to reduce storage costs using a predefined lifecycle policy. There are no early deletion charges, no retrieval charges, and no charges for storage class transitions. For more information, see Google's documentation on Autoclass.
The Cloud Environment consists of a Google Cloud virtual machine (VM or cluster) loaded with software, a boot disk and Persistent Disk storage. You can delete and update the Cloud Environment as your software and computational power needs evolve.
Your Cloud Environment is unique to youWhile workspace co-owners have access to the same workspace Google bucket, they do not have access to each other's Cloud Environment (VM and PD) - they won't see your Cloud Environment files and you can't see theirs.
To learn more about managing and using your workspace Cloud Environment, see the Cloud Environments Analysis section in Terra Support.
Your Cloud Environment's virtual machine (VM) is like a personal computer in the cloud. Just like your computer, it has memory and a file system. In every workspace, each individual user can create a unique Cloud Environment with a personal VM and Persistent Disk.
You can choose to keep or delete your PD when you recreate or delete your Cloud Environment VM. To learn more, see Understanding and adjusting your Cloud Environment)
Cloud Environment VM versus Workflow VMYour custom Cloud Environment VM is used for interactive analysis applications like Notebooks, RStudio and Galaxy. It is separate from the Google Cloud VM used for running workflows (read more about it here).
Your workflow VM(s) (not shown in the diagram above) interact with execution tools like Cromwell, which automatically create your VM and transfer your workflow data.
The persistent disk (PD) storage is a VM component that can be detached and reattached to the Cloud Environment VM (like a USB drive) when you recreate it. If you need to update your VM with new software, for example, your PD allows you to save files for later use.
Because it's part of your VM, the persistent disk is also personal to you, the user; nobody else has access to it, including collaborators sharing the same workspace.
Just like your local computer, your VM and its accompanying persistent disk storage can be accessed using a Terminal (black box in diagram above) and bash terminal commands, like "pwd," "ls," "ftp," "curl," "ssh," and more.
You can use the terminal to move files from a VM directory into the persistent disk. You can also use the terminal to move files from external cloud resources into the VM (and vice versa- see below!).
To learn more, see Using the Terminal and interactive analysis shell in Terra.
You, your colleagues or organization may also have data stored in external cloud storage (shown in grey, outside Terra, above). This could be a Google bucket, an Amazon Web Services data lake, Azure blob, etc. Although external cloud storage integrates with Terra - in the sense that you can import data for an analysis - you can't peek inside an external bucket using the Terra platform (i.e. see what's inside from Files in the Terra Data page). You can, however, see the data and its metadata in a workspace table, no matter where in the cloud it is stored.
To learn more about cloud-agnostic Uniform Resource Identifiers (URIs), see Data access with the GA4GH Data Repository Service (DRS)
The different storage systems have different life cyclesWorkspace storage (Google bucket): The workspace bucket is created with the workspace, and is deleted when the workspace is deleted.
Persistent disk: The persistent disk is created with the first cloud environment. Unless you intentionally delete it, it exists as long as you are a member of the billing project. If you intentionally delete it when you at the same time delete your cloud environment, a persistent disk is created when you create a cloud environment again. Any data that was on the persistent disk before deletion is lost.
What happens when multiple people spin up cloud environments in the same workspace?
Cloud Environments are unique to each user. This means that two people in the same workspace working in a Jupyter or RStudio cloud environment will each have a separate Cloud Environment running, each with its own VM and Persistent Disk.
Colleagues cannot see generated data stored on the persistent disk because each user has their own PD. Note that everyone will be able to see data displayed in notebook cells or standard output, since .ipynb files sync with workspace storage. To share generated data stored in the Cloud Environment PD, you will need to copy to workspace storage (see instructions below).
- Avoiding overwriting: When a second user launches the same Jupyter notebook, they will be in "Playground" mode. This prevents users from overwriting each other, since .ipynb files are saved in the workspace storage (Google bucket).
Cloud component organization
The left column in the table below outlines cloud components for data storage. The right column specifies the compute resources.
Where your data is stored depends on what kind of data it is
Data “in a Terra workspace" falls into two different data types - tabular data and unstructured data. Each type has its own cloud storage mechanism in Terra.
- Unstructured data can be stored in the workspace's dedicated cloud storage (Google Bucket) or the Cloud Environment Persistent Disk.
- Tabular data (including primary data like clinical records or lab results, or links to large files in cloud storage) is stored in data tables.
Where your data is stored may depend on your analysis
- Generated data from a workflow analysis is stored in workspace blob storage by default.
- Data generated in JupyterLab is stored in the Cloud Environment Persistent Disk by default
Data Storage in a Terra Workspace
Virtual Machines in a Terra Workspace
Workspace cloud storage (Google bucket)
- Created when a workspace is created or cloned. Exists until workspace is deleted
- Configured in WDL
- Can be one, or many VMs, as needed, for parallel processing jobs
Exists only when workflow is running
Cloud Environment persistent disks
- Exists until you dele the PD
Cloud Environment VM
- Exists until you delete the Cloud Environment
Data tables (Terra infrastructure)
Store tabular primary data (such as phenotypic or demographic data or personal health records; reference data files in Cloud storage)
Reference URIs (metadata) for data files in external or workspace cloud storage
Communicating between cloud resources
You might be wondering when to move files between the different storage options and VMs and how to do it.
Reasons to move data from the PD to workspace storage (Google bucket)
- To share data generated in Galaxy, Jupyter, or RStudio with a colleague
- To use generated data as workflow input
- To archive data generated in a Cloud Environment app
How to move files between different Terra storage options
There are a number of different ways to move files; which you use depends on the size and number of files you are moving and how comfortable you are with each tool.
To learn more about why and how to move files from your Cloud Environment PD to the Workspace bucket, see How (and why) to save data generated in a notebook to a Workspace bucket.
A note about collaborator accessWhile the cloud environment VM, boot disk, and persistent disk are accessible only by you, the workspace storage (Google bucket) is accessible by all colleagues with access to the workspace. Any data you do not want to share must be stored on the persistent disk (or the Compute Engine instance boot disk – which is not a good practice as it is deleted every time the Compute Engine instance is deleted and recreated).
The opposite use-case is true, as well: to share data generated in an interactive analysis (and stored by default on your Cloud Environment PD), you will need to copy it to the workspace bucket. This is because each user's Cloud Environment (and associated PD) is unique.
Click the title to open detailed instructions for each case.
Unfamiliar with gsutil? See the gsutil tutorial to learn how to set up and run gsutil
You can download files generated in a notebook right in your workspace Cloud Environment by clicking on the Jupyter logo in an open notebook, selecting the file you want to download, and clicking on Download.
For small numbers of small files, you can upload to your Workspace bucket from the Data page, by clicking on the Files icon (at the bottom of the left column) and the Upload button (at the top of the page).
When it comes to moving files between Google Cloud storage (including your Workspace bucket) and your VM/PD, you'll often want to use gsutil running in a Cloud Environment terminal instance. gsutil is Google Cloud Platform's utility package for manipulating cloud data on the GCP infrastructure.
The Cloud Environment's default software (or application configuration) includes gsutil pre-installed on your VM/Terminal. You can use gsutil commands to move files from your workspace/external Google bucket into your PD, and vice versa. You can also use this tool to move files between different external Google buckets.
Behind the curtain, every workspace has its own Google project
Google projects are a Google infrastructure tool for managing and deploying cloud resources. Terra uses the workspace Google project to track spending (cloud storage, compute and egress costs) in the workspace. Because Google uses projects to make sure cloud resources are distributed efficiently, they are sometimes subject to quotas. This can limit the number and size of VMs and disks that can run in a single workspace.
For more details, see Are resource quotas slowing your analysis down?
Terra Billing Project
All Google Cloud costs are passed through a Terra Billing project (the green rectangle in the diagram below). Billing projects are linked to a Google Cloud billing account and can encompass many Terra workspaces.
- Cloud Environment FAQs: commonly asked questions and answers about the Cloud Environment
- Understanding and Customizing your Cloud Environment: a guide to setting up the software and compute for your analysis needs
- Using the Terminal and Interactive Analysis in Shell in Terra: a guide to using the Terminal
- Moving data to/from a Google bucket (workspace or external): useful instructions and code for moving files