Terra architecture and where your files live in it

Liz Kiernan
  • Updated

Learn how cloud storage (Workspace buckets) and the Cloud Environment (virtual machine, application configuration, and persistent disk) integrate with the Terra workspace and how to store and work with files across the different resources. 

Background

If you've just started using Terra, your head might be swimming in terms like "Google buckets," "cloud compute," and "persistent disks." You might be wondering, "What do all these Terra workspace components do and how do they fit together?" or "Where are my files **actually** located when I'm using these features?" 

Glancing at a Terra workspace, it seems like all the components for your analysis are just parts of the Terra platform. You can access them all right at the tip of your fingers (or rather, your browser).

In reality, many of Terra's features use external cloud resources such as Google buckets and virtual machines (VMs) on the Google Cloud Platform. Terra integrates these resources behind the scenes so you can use them in your workspace UI.

In this article, we review the basic Terra architecture, focusing on how Terra interacts with cloud storage and Cloud Environments (virtual machines) and how your files are located across these resources.

Terra- the basic components

The diagram below shows the basic Terra components. Read on to learn more about each part, who has access to them, and how they communicate.

Cloud_Environment_architecture_Copying-data-from-bucket-to-PD_diagram.png

The Workspace and its dedicated Google Bucket

The heart of Terra is the workspace - where you can organize and access data and tools and perform analyses. To create a workspace, you need a Terra billing project, which you can also choose to share with your colleagues. The same billing project can be used for more than one workspace (see the diagram above). Terra uses the billing project as a pass-through to pay GCP for storage, egress and compute costs. 

Each workspace has its own unique Google project (gray rectangle) for workspace GCP services and Google bucket for storage (white rectangle with storage icon inside the pale green workspace above). The workspace bucket exists on the Google Cloud Platform. However, unlike external Google buckets, the workspace bucket is special - it is covered by Terra's built-in security (see this article to learn more about Terra's security posture) and interfaces with Terra directly so you can see what’s inside it and manipulate it from the Files section of the workspace Data page. Additionally, you can point/link to - and access - data in a Google bucket from one or more data tables (the table icon in the top right of each workspace above), which integrate directly with workflows in the UI. 

The bucket and Google project are "attached" to the workspace - if you delete the workspace, you delete the bucket and the Google project. And when you share workspace ownership with colleagues, you also share the workspace Google bucket (see more on sharing here) with the same permission levels. 

Other (external) cloud storage

You, your colleagues or organization may additionally have data stored in external cloud storage (shown in grey, outside Terra, above). This could be a Google bucket, Amazon Web Services, Azure, etc. Although external cloud storage integrates with Terra in the sense that you can import data, you can't peak inside an external bucket using the Terra platform (i.e. see what's inside from the Terra Data page). You can, however, see the data and its metadata in a workspace table, no matter where in the cloud it is stored. 

Cloud Environment

On Terra, you can create and access a Cloud Environment, which allows you to run interactive applications like Jupyter Notebooks, RStudio, or Galaxy. This environment consists of a Google Cloud virtual machine (VM; shown in dark green above) or cluster loaded with software and persistent disk storage. You can delete and update the Cloud Environment as your software needs evolve for your analysis (read more about adjusting the Cloud Environment in this article). 

Virtual Machine (VM)

Your Cloud Environment's virtual machine (VM) is like a personal computer in the cloud. Just like your computer, it has memory and a file system. In every workspace, each individual user can create a unique Cloud Environment with a personal VM and Persistent Disk.  While workspace co-owners have access to the same workspace Google bucket, they do not have access to each other's Cloud Environment (VM and PD) - they won't see your Cloud Environment files and you can't see theirs. 

Any time you update the Cloud Environment, it's a good idea to backup files to either the persistent disk storage or Workspace bucket (read more about updating the Cloud Environment in this article). You can choose to keep or delete your PD when you recreate or delete your Cloud Environment VM.

Cloud Environment VM versus Workflow VM

Your custom Cloud Environment VM is used for interactive analysis applications like Notebooks, RStudio and Galaxy. It is separate from the Google Cloud VM used for running workflows (read more about it here). Your workflow VM(s) (not shown in the diagram above) interact with execution tools like Cromwell, which automatically create your VM and transfer your workflow data. 

Detachable Persistent Disk

The persistent disk (PD) storage is part of your VM, that can be detached and reattached to the Cloud Environment VM (like a USB drive) when you recreate it. If you need to update your VM with new software or delete it for different use-cases across workspaces, your PD allows you to save files for later use. Because it's part of your VM, the persistent disk is also personal to you, the user; nobody else has access to it.

If you're saving to the PD, the folder location you use depends on the application you're running. When using Notebook applications, move files to:

 /home/jupyter-user/notebooks

When using RStudio, move files to:

/home/RStudio 

Terminal

Just like your local computer, your VM and its accompanying persistent disk storage can be accessed using a Terminal (black box in diagram above) and bash terminal commands, like "pwd," "ls," "ftp," "curl," "ssh," and more.

You can use the terminal to move files from a VM directory into the persistent disk. You can also use the terminal to move files from external cloud resources into the VM (and vice versa- see below!). 

Communicating between cloud resources

Now that you've been introduced to the (very) basic pieces of the Terra architecture, you might be wondering how you move files between them. If you want to archive data generated in a Cloud Environment app, share with a colleague, or use as workflow input, you will need to copy to the workspace or other Google bucket. There are a number of different ways to move files; which you use depends on the size and number of files you are moving and ow comfortable you are with each tool. 

To learn more about why and how to move files from your Cloud Environment PD to the Workspace bucket, see this article

Downloading generated files from the PD (small numbers, small files)

You can download files generated in a notebook right in in your workspace  Cloud Environment by clicking on the Jupyter logo in an open notebook and right-clicking on a file to download. Download-from-Cloud-Environment_Click-Jupyter-logo_Screen_shot.png

Uploading from local storage to the Workspace bucket (small numbers, small files)

For small numbers of small files, you can upload to your Workspace bucket from the Data page, by clicking on the File icon (at the bottom of the left column) and the "+" icon (at the bottom of the page).
Upload-files-from-the-Data-page_Screen_shot.png

Moving large numbers or large data files between Workspace bucket and PD (gsutil) 

When it comes to moving files between Google Cloud storage (including your Workspace bucket) and your VM/PD, you'll often want to use gsutil running in a Cloud Environment terminal instance. gsutil is Google Cloud Platform's utility package for manipulating cloud data on the GCP infrastructure.

Cloud_Environment_architecture_Copying-data-from-bucket-to-PD_diagram.png

The Cloud Environment's default software (or application configuration) includes preinstalled on your VM/Terminal. You can use gsutil commands to move files from your workspace/external Google bucket into your PD, and vice versa. You can also use this tool to move files between different external Google buckets. 

Additional resources

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.