Learn how cloud storage (Workspace buckets) and the Cloud Environment (virtual machine, application configuration, and persistent disk) integrate with the Terra workspace and how to store and work with files across the different resources.
If you've just started using Terra, your head might be swimming in terms like "Google buckets," "cloud compute," and "persistent disks." You might be wondering, "What do all these Terra workspace components do and how do they fit together?" or "Where are my files actually located when I'm using these features?"
Glancing at a Terra workspace, it may seem like all the components for your analysis are just parts of the Terra platform. You can access them all right at the tip of your fingers (or rather, your browser).
In reality, many of Terra's features use external cloud resources such as Google buckets and virtual machines (VMs) on the Google Cloud Platform, workflows in Dockstore or the Broad Methods Repository, and data from an external server such as Gen3 (AnVIL instance). Terra interfaces with these resources behind the scenes so you can use them in your workspace.
In this article, we review the basic Terra architecture, focusing on how Terra manages cloud storage and Cloud Environments (VMs) and how your files are stored across these resources.
Terra - the basic components
The diagram below shows how a Terra workspace, GitHub and the various storage systems like disks and Google Cloud Storage buckets are related. This is important background information for you since you are interfacing and interacting with all of these during source control activities. Read on to learn more about each part, who has access to them, and how they communicate. Components in green (Terra Billing project, workspace) are part of Terra's infrastructure. The Google project, workspace bucket, and external cloud storage are built on GCP infrastructure.
Click the sections below to expand for more detail about each component.
To create a workspace, you need a Terra billing project, which you can also choose to share with your colleagues. The same billing project can be used for more than one workspace. Terra uses the billing project as a pass-through to pay GCP for storage, egress and compute costs.
To learn more, see the Managing Cloud Costs section in Terra Support.
- When you create a workspace, Terra generates a Google project to organize all your cloud resources. The Google project is mostly behind the scenes, unless you want to access features not (yet) built into Terra, such as itemized workspace billing or setting up budget alerts.
Note that Terra Billing projects and Google projects are separate entities (and the Google project is mostly behind the scenes). This process works in only one direction. It is not possible to create a GCP project and move or assign it to a Terra workspace.
The heart of Terra is the workspace - where you can organize and access data and tools and perform analyses. It includes built-in interfaces for documentation (the dashboard), data, interactive analysis and bulk analysis (pipelining/workflows) tools, and provenance (a Job History - i.e. record of all submitted workflows).To learn more, see the Workspaces section in Terra Support.
Each workspace has its own unique Google bucket for storage (white rectangle with storage icon inside the pale green workspace above). The workspace bucket exists on the Google Cloud Platform, and is created/deleted when the workspace is created/deleted.
How is a workspace bucket different from any other Google bucket? Unlike external Google buckets, the workspace bucket is special - it is covered by Terra's built-in security (see Terra security posture to learn more) and interfaces with Terra directly so you can see what’s inside it and manipulate it from the Files section of the workspace Data page. You can point/link to - and access - data in a Google bucket - workspace or external - from one or more data tables (the table icon in the top right of each workspace above), which integrate directly with workflows in Terra.
The bucket and Google project are "attached" to the workspace. If you delete the workspace, you delete the bucket and the Google project. And when you share workspace ownership with colleagues, you also share the workspace Google bucket (see Best practices for sharing and protecting data resources to learn more) with the same permission levels.
The Cloud Environment consists of a Google Cloud virtual machine (VM or cluster) loaded with software, a boot disk and Persistent Disk storage. You can delete and update the Cloud Environment as your software needs evolve for your analysis.
Your Cloud Environment is unique to youWhile workspace co-owners have access to the same workspace Google bucket, they do not have access to each other's Cloud Environment (VM and PD) - they won't see your Cloud Environment files and you can't see theirs.
To learn more about managing and using your workspace Cloud Environment, see the Cloud Environments Analysis section in Terra Support.
Your Cloud Environment's virtual machine (VM) is like a personal computer in the cloud. Just like your computer, it has memory and a file system. In every workspace, each individual user can create a unique Cloud Environment with a personal VM and Persistent Disk.
You can choose to keep or delete your PD when you recreate or delete your Cloud Environment VM. To learn more, see Understanding and adjusting your Cloud Environment)
Cloud Environment VM versus Workflow VMYour custom Cloud Environment VM is used for interactive analysis applications like Notebooks, RStudio and Galaxy. It is separate from the Google Cloud VM used for running workflows (read more about it here).
Your workflow VM(s) (not shown in the diagram above) interact with execution tools like Cromwell, which automatically create your VM and transfer your workflow data.
The persistent disk (PD) storage is a VM component that can be detached and reattached to the Cloud Environment VM (like a USB drive) when you recreate it. If you need to update your VM with new software, for example, your PD allows you to save files for later use.
Because it's part of your VM, the persistent disk is also personal to you, the user; nobody else has access to it, including collaborators sharing the same workspace.
Saving files (PD directory structure)If you're saving to the PD, the folder location you use depends on the application you're running. When using Notebook applications, move files to:
!echo $HOMEfrom within your notebook to figure out the name of the home directory.
When using RStudio, move files to:
Just like your local computer, your VM and its accompanying persistent disk storage can be accessed using a Terminal (black box in diagram above) and bash terminal commands, like "pwd," "ls," "ftp," "curl," "ssh," and more.
You can use the terminal to move files from a VM directory into the persistent disk. You can also use the terminal to move files from external cloud resources into the VM (and vice versa- see below!).
To learn more, see Using the Terminal and interactive analysis shell in Terra.
You, your colleagues or organization may also have data stored in external cloud storage (shown in grey, outside Terra, above). This could be a Google bucket, an Amazon Web Services data lake, Azure blob, etc. Although external cloud storage integrates with Terra - in the sense that you can import data for an analysis - you can't peak inside an external bucket using the Terra platform (i.e. see what's inside from Files in the Terra Data page). You can, however, see the data and its metadata in a workspace table, no matter where in the cloud it is stored.
To learn more about cloud-agnostic Uniform Resource Identifiers (URIs), see Data access with the GA4GH Data Repository Service (DRS).
The different storage systems have different life cycles Workspace bucket: The workspace bucket is created with the workspace, and is deleted when the workspace is deleted.
Persistent disk: The persistent disk is created with the first cloud environment. Unless you intentionally delete it, it exists as long as you are a member of the billing project. If you intentionally delete it when you at the same time delete your cloud environment, a persistent disk is created when you create a cloud environment again. Any data that was on the persistent disk before deletion is lost.
VM boot disk: The boot disk of a Compute Engine instance has the exact same life cycle as the Compute Engine instance: it gets created with the Compute Engine instance and deleted once the Compute Engine instance is deleted. Any data on a boot disk is lost when the Compute Engine instance is deleted. It is best practice to not store data on the boot disk that you cannot completely recreate.
Communicating between cloud resources
Now that you've been introduced to the (very) basic pieces of the Terra architecture, you might be wondering how you move files between them. If you want to archive data generated in a Cloud Environment app, share it with a colleague, or use it as workflow input, you will need to copy it to the workspace or other Google bucket. There are a number of different ways to move files; which you use depends on the size and number of files you are moving and how comfortable you are with each tool.
To learn more about why and how to move files from your Cloud Environment PD to the Workspace bucket, see How (and why) to save data generated in a notebook to a Workspace bucket.
A note about collaborator accessWhile the VM, its boot disk, and the persistent disk in your cloud environment are accessible only by you, the workspace bucket is accessible by every user that has access to the workspace. Any data that you do not want to share you have to keep in the persistent disk (or the Compute Engine instance boot disk – which is not a good practice as it is deleted every time the Compute Engine instance is deleted and recreated).
The opposite use-case is true, as well: to share data generated in an interactive analysis (and stored by default on your Cloud Environment PD), you will need to copy it to the workspace bucket. This is because each user's Cloud Environment (and associated PD) is unique.
Downloading generated files from the PD (small numbers, small files)
You can download files generated in a notebook right in your workspace Cloud Environment by clicking on the Jupyter logo in an open notebook and right-clicking on a file to download.
Uploading from local storage to the Workspace bucket (small numbers, small files)
For small numbers of small files, you can upload to your Workspace bucket from the Data page, by clicking on the File icon (at the bottom of the left column) and the "+" icon (at the bottom of the page).
Moving large numbers or large data files between Workspace bucket and PD (gsutil)
When it comes to moving files between Google Cloud storage (including your Workspace bucket) and your VM/PD, you'll often want to use gsutil running in a Cloud Environment terminal instance. gsutil is Google Cloud Platform's utility package for manipulating cloud data on the GCP infrastructure.
The Cloud Environment's default software (or application configuration) includes gsutil pre-installed on your VM/Terminal. You can use gsutil commands to move files from your workspace/external Google bucket into your PD, and vice versa. You can also use this tool to move files between different external Google buckets.
- Cloud Environment FAQs: commonly asked questions and answers about the Cloud Environment
- Understanding and Customizing your Cloud Environment: a guide to setting up the software and compute for your analysis needs
- Using the Terminal and Interactive Analysis in Shell in Terra: a guide to using the Terminal
- Moving data to/from a Google bucket (workspace or external): useful instructions and code for moving files