Working with non-US data in Terra

Allie Hajian
  • Updated

To support researchers with data outside of the US, Terra allows you to provision Cloud resources (Workspace buckets and VMs) in non-US regions. Given the complex nature of regionality rules and restrictions, it's important to understand policies around your data before using Terra. This article describes key elements of Terra architecture to assist you.

Source material content for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality. 

Terra architecture

The following overview of Terra components illustrates which elements' location are under user control and which are managed by Terra.

Terra-infrastructure-regions_Diagram.png

Terra Control Plane

The Terra Control Plane today runs on Google Cloud Platform (GCP) services in the United States. This includes components that store and transmit:

Workspace buckets

Workspace buckets are used as durable storage for:

  • Jupyter notebooks
  • Workflow execution logs and generated data (i.e. task outputs)

They may also be the place where you store your own files, such as:

  • Research data
  • Intermediate results

Default and user-controlled workspace bucket regions
The default location for workspace buckets is US multi-region. When creating a workspace, you can choose a cloud storage region appropriate for your data. See Customizing where your data are stored and analyzed for more information. 

Workflow VMs

When a workflow runs, orchestration is provided by the Terra Control Plane (see above). Specific tasks in your workflow are executed on VMs whose location is determined by you.

Default and user-controlled VM regions
Every workflow in WDL allows you to
specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, every task in a WDL workflow allows you to specify a list of Compute Engine zones for that task's VM, which will override the default for the workflow.

Default zones for workflow VMs in Terra 

The default list of zones for VMs running a workflow in Terra are based on the workspace bucket location.

Workspace bucket location Default list of workflow VM zones
Regional All zones in the workspace bucket region
US multi-regional All zones in us-central1.

However, individual workflow WDLs can override the default, either with:
   - A hard-coded list in the workflow WDL, or   
   - Allowing a user-input value to be used as the list of zones.

For example, here is a hard-coded list of zones:

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

and here is a workflow input option, calledruntime_zones

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

That can be set on the workflow submission page:
Data-regionality_Specify-runtime-zones-in-UI_Screen_shot.png

Warning: Be careful to check the workflow WDL and inputs carefully for the zones used! Moving data out of its storage region may violate the policies governing that data, and can incur network egress charges

Cloud Environment VMs

Cloud Environments host services such as:

The location of the Cloud Environment VM is determined by you, but will default based on workspace bucket location.

Workspace bucket location Default Cloud Environment VM locations
Regional All zones in the workspace bucket region
US multi-regional All zones in us-central1-a

Note that all Cloud Environments provide Detachable Persistent Disks for faster storage of inputs and outputs for your notebook, R, and Galaxy-based analyses (note that you must copy to your workspace bucket for access outside of the Cloud Environment). The default location of the PD follows the VM.

Caveats when choosing storage and VM regions

With the above architecture, you can store your data in a regional workspace bucket, run workflows on VMs, and run analyses on VMs in your region of choice. However, there are pitfalls around both data policy and cloud costs that you'll want to be aware of.

Control panel elements policy considerations

Review the list of control plane elements above to be sure you do not store data that must remain "in region" in these locations.

In particular, take care when using workspace tables to drive Terra workflows (i.e. using the "run workflow(s) with inputs defined by data table" option. These tables are stored in data centers in the US. Your data policies may allow for you to load de-identified samples identifiers and paths to Cloud Storage files into a workspace table, whereas loading other participant-level information into workspace tables may not be in compliance with your regional data policies. Terra today is not aware of such policies and will not prevent you from uploading data into your workspace tables.

Cloud Environment VM policy considerations

It is recommended that you create a separate Terra Billing project for each region in which you have data.

By default, Terra will create your Cloud Environment in the region of the workspace you're in when you create the Cloud Environment. However:

  • Each individual user has one Cloud Environment per Terra Billing project, and
  • Each Terra Billing project can have multiple workspaces

Because you access data from all workspaces under a single Billing project in your single Cloud environment, you may need to move data out of at least one region if you have multiple workspaces with storage in different regions under a single billing project.

Cloud costs - Network egress charges

If you can, keep all of your buckets and VMs in the same region to avoid network egress charges.

Network egress charges occur when moving data out of a Google Cloud region, most commonly when moving data from:

  • Cloud Storage bucket to Cloud Storage bucket
  • Cloud Storage bucket to Compute Engine VM

Examples of where egress charges may occur

When running analyses and workflow be aware of data and VM locations, in particular:

  • Public resource files are often in US multi-regional buckets
    Accessing files stored in the US from a non-US VM will incur egress charges

  • Workflow VMs may hard-code or have default lists of zones that are in the US
    Accessing non-US files from VMs in the US will incur egress charges

  • Workspaces created before  September 27, 2021: Terra Billing projects can have a single Cloud Environment serving multiple workspaces
    I
    f workspace buckets have different locations, doing analyses from a single Cloud Environment VM will incur egress charges fetching data from outside the VM's region

Final notes

Google Cloud has additional storage and regionality capabilities that we hope to one day add to Terra.

  • Nearline
  • Coldline
  • Archive
  • Dual-region
  • Multi-region EU or ASIA

Additional resources

If you are comfortable with working in GCP console, see Accessing advanced GCP features in Terra to take advantage of these capabilities in non-Terra GCP projects. 

To learn more about regional selections, see Best practices for Compute Engine regions selection.

To learn more about Cloud pricing, see Understanding and controlling Cloud costs.

To learn more about Terra storage and compute location controls, see 
Customizing where your data are stored and analyzed.

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.