An overview of how cloud storage (Google buckets) and the Cloud Environment (virtual machine, application configuration, and persistent disk) integrate with the Terra workspace and how your files live across the different resources.
If you've just started using Terra, your head might be swimming in terms like "Google buckets," "cloud compute," and "persistent disks." You might be wondering, "What do all these Terra workspace components do and how do they fit together?" or "Where are my files ACTUALLY located when I'm using these features?"
Glancing at a Terra workspace, it seems like everything from your Google bucket to your Cloud Environment is just a part the Terra platform. You can access it all right at the tip of your fingers (or rather, your workspace Dashboard).
In reality, many of these features use external cloud resources such as Google buckets or virtual machines (VMs) on the Google Cloud Platform. Terra integrates these resources behind the scenes so you can use them in your workspace UI.
In this article, we review the basic Terra architecture, focusing on how Terra interacts with cloud storage and cloud environments (virtual machines) and how your files are located across these resources.
Terra- the basic components
The diagram below shows the basic Terra components and how they relate to one another. Read on to learn more about each part, who has access to them, and how they communicate.
The Workspace and its Google Bucket
The heart of Terra is the workspace - where you can organize and access data and tools and perform analyses. To create a workspace, you need a Terra billing project, which you can also choose to share with your colleagues. The same billing project can be used for more than one workspace (see the diagram above). Terra uses the billing project to pay GCP for storage, egress and compute costs. Storage includes the workspace Google bucket (green cylinder above), that lives in the cloud.
Each workspace you create has its own unique Google bucket. This bucket is separate from your workspace- it exists on the Google Cloud Platform. However, unlike external Google buckets, the workspace bucket is special - it interfaces with Terra so you can see what’s inside it and manipulate it from the Files section of the workspace Data page. Additionally, you can point (link) to - and access - data in a Google bucket from data tables (the table icon in the middle of each workspace above).
The bucket is "attached" to the workspace - if you delete the workspace, you delete the bucket. And when you share workspace ownership with colleagues, you also share the workspace Google bucket (see more on sharing here).
Other (external) cloud storage
You, your colleagues or organization may additionally have data stored in external cloud storage (shown in grey above). This could be a Google bucket, Amazon Web Services, Azure, etc.. Although external cloud storage integrates with Terra in the sense that you can import data, you can't peak inside an external bucket using the Terra platform (i.e. see what's inside from the Terra Data page).
On Terra, you can create and access a Cloud Environment, which allows you to run interactive applications like Jupyter Notebooks, RStudio, or Galaxy. This environment consists of a Google Cloud virtual machine (VM; shown in dark green above) or cluster loaded with software and persistent disk storage. You can delete and update the Cloud Environment as your software needs evolve for your analysis (read more about updating the Cloud Environment in this article).
Virtual Machine (VM)
Your Cloud Environment's virtual machine (VM) is kind of like a personal computer in the cloud. Just like your computer, it has memory and a file system. For every billing project, an individual user can create a unique Cloud Environment with a personal VM. While workspace co-owners have access to the same workspace Google bucket, they do not have access to each other's VM - they won't see your VM files and you can't see theirs.
When you create a second workspace on the same billing project, the two workspaces share the same Cloud Environment/VM (see the top diagram). This will be changing in the near future, but for now, it means if you launch and save files to the VM using one workspace, you’ll be able to see the files stored in that VM from the second workspace.
Even if your workspaces share a billing project, they may be designed for different kinds of analyses, with each one requiring different software or compute settings. As you're switching between these different workspaces, you mIght decide to update or recreate your Cloud Environment/VM with the new settings. Any time you update the Cloud Environment, it's a good idea to backup files to either the persistent disk storage or a Google bucket, as updates to the Cloud Environment might delete any files you have stored on the VM (read more about updating the Cloud Environment in this article).
Cloud Environment VM vs. Workflow VM
While your custom Cloud Environment VM is used for analysis on applications like Notebooks, it is separate from the Google Cloud VM that is used for running Terra workflows (read more about it here). Your workflow VM(s) (not shown in the diagram above) interact with execution tools like Cromwell, which automatically create your VM and transfer your workflow data.
Detachable Persistent Disk
The persistent disk (PD) storage is part of your VM, but you can detach it and reattach it, kind of like you would a USB drive. If you need to update your VM with new software or delete it for different use-cases across workspaces, your PD allows you to save files for later use. Because it's part of your VM, the persistent disk is also personal to you, the user; nobody else has access to it.
Just like your local computer, your VM and its accompanying persistent disk storage can be accessed using a Terminal (black box in diagram above) and bash terminal commands, like "pwd," "ls," "ftp," "curl," "ssh," and more.
You can use the terminal to move files from a VM directory into the persistent disk. You can also use the terminal to move files from external cloud resources into the VM (and vice versa- see below!).
Communicating between cloud resources
Now that you've seen the (very) basic pieces of the Terra architecture, you might be wondering how you move files between them. While there are multiple ways the different Terra components communicate, when it comes to moving files between Google Cloud storage and your VM/PD, you'll need to use gsutil, the Google Cloud Platform's package for manipulating cloud data.
When using the Cloud Environment's default software (or application configuration), gsutil will be preinstalled on your VM/Terminal. Using gsutil, you can move files from your workspace/external Google bucket into your VM and vice versa. You can also use this tool to move files between different external Google buckets.
Since your VM is customized with software to meet your analysis needs (GATK, Bioconductor, etc.), you might find yourself often updating or deleting your Cloud Environment/VM as you work across different workspaces and data. This is why it's important to back up VM files and/or save them to your PD storage or, for a more permanent option, a Google bucket.
If you're saving to the PD, the folder location you use depends on the application you're running (see grey box in diagram above). When using Notebook applications, move files to:
When using RStudio, move files to:
For more information on each of these features, check out the suggested support articles in the Resources section.
- Cloud Environment FAQs: commonly asked questions and answers about the Cloud Environment
- Understanding and Customizing your Cloud Environment: guide to setting up the software and compute for your analysis needs
- Using the Terminal and Interactive Analysis in Shell in Terra: a guide to using the Terminal
- Moving data to from a workspace Google bucket: useful instructions and code for moving files