Working with non-US data in Terra

Allie Hajian
  • Updated

To support researchers with data outside the US, Terra lets you provision cloud resources - Workspace buckets and virtual machines (VMs) - in non-US regions. This article reviews how key elements of Terra's architecture may interact with regionality constraints on your data. 

Source material content for this article was contributed by Matt Bookman and the Verily Life Sciences solutions team as part of the design and engineering rollout of Terra support for data regionality. 

Terra architecture

This overview of Terra components illustrates which elements' locations are under user control and which are managed by Terra.

Diagram illustrating which components of Terra have locations controlled by the user: these include workspace buckets, workflows, and cloud environments.

Terra Control Plane

Today, the Terra Control Plane runs on Google Cloud services in the United States.

What is stored and transmitted by the control plane components?

Workspace storage (Google buckets)

Workspace buckets are used as durable storage for:

  • Jupyter Notebooks
  • Workflow execution logs and generated data (i.e., task outputs)

You may store your own files in workspace buckets, such as:

  • Research data
  • Intermediate results

Default and user-controlled workspace storage regions

The default location for workspace buckets is us-central1.

When creating a workspace, you can choose an alternate cloud storage region appropriate for your data. See Customizing where your data are stored and analyzed for more information. 

Choosing a workspace bucket location outside of the US and CanadaYou can only select from a limited number of workspace bucket regions from the Terra web interface. However, you can choose from many more regions by creating your workspace through the createWorkspace API endpoint. See How to customize your workspace bucket storage region for more information.

Note that you should still double-check that your workflow and Cloud Environment VMs are in the same region, and not the default region.

Workflow virtual machines (VMs)

When a workflow runs, orchestration is provided by Terra (see above). Specific tasks in your workflow are executed on VMs whose location is determined by you and specified in the Workflow Description Language (WDL) code.

Default and user-controlled VM regions

Every workflow written in WDL allows you to specify a list of Compute Engine zones as a default for the VMs in that workflow. In addition, every task in a WDL workflow allows you to specify a list of Compute Engine zones for that task's VM, which will override the default for the workflow.

Default zones for workflow VMs in Terra 

The default list of zones for VMs running a workflow in Terra is based on the workspace bucket location.

Workspace bucket location Default list of workflow VM zones
Regional All zones in the workspace bucket region
US multiregional All zones in us-central1.

How to override the default (individual workflow WDLs)

  • Use a hard-coded list in the workflow WDL, or   
  • Allow a user-input (attribute) to be used as the list of zones.

Example override (hard-coded list of zones)

runtime {
 docker: "python:slim"
 disks: "local-disk 200 HDD"
   memory: "4G"
   cpu: 1
 zones: "us-central1-a us-central1-b us-central1-c us-central1-f"
}

Example override (workflow input option, runtime_zones)

workflow MyWorkflow {
   String runtime_zones 

   ...

   runtime {
      docker: "python:slim"
      disks: "local-disk 200 HDD"
      memory: "4G"
      cpu: 1
      zones: runtime_zones
}

Example overrides (set on the workflow submission page)
Screenshot showing an example of how to specify a workflow's VM region in the workflow configuration page.

Be careful to check the workflow WDL and inputs for the zones used!Moving data out of its storage region may violate the policies governing that data, and can incur network data transfer charges

Cloud Environment VMs

Cloud Environments hosted services

The location of the Cloud Environment VM is determined by you, but will default based on workspace bucket location.

Workspace bucket location Default Cloud Environment VM locations
Regional All zones in the workspace bucket region
US multiregional All zones in us-central1-a

Cloud environment storage (persistent disk) locationAll Cloud Environments provide Detachable Persistent Disks for faster storage of inputs and outputs for your notebook, R, and Galaxy-based analyses (you must copy to your workspace bucket for access outside of the Cloud Environment). The default location of the PD follows the VM.

Caveats when choosing storage and VM regions

With the above architecture, you can store your data in a regional workspace bucket, run workflows on VMs, and run analyses on VMs in your region of choice. However, there are pitfalls around both data policy and cloud costs to consider.

Control panel elements policy considerations

Review the list of control plane elements above to be sure you do not store data that must remain "in region" in these locations.

In particular, take care when using workspace tables to drive Terra workflows (i.e., using the "run workflow(s) with inputs defined by data table" option. These tables are stored in data centers in the US. Your data policies may allow for you to load de-identified samples identifiers and paths to Cloud Storage files into a workspace table, whereas loading other participant-level information into workspace tables may not comply with your regional data policies. Terra is not aware of such policies and will not prevent you from uploading data into your workspace tables.

Cloud Environment VM policy considerations

Recommendation: Create a separate Terra Billing project for each region in which you have data.

By default, Terra will create your Cloud Environment in the region of the workspace you're in when you create the Cloud Environment. However:

  • Each individual user has one Cloud Environment per Terra Billing project, and
  • Each Terra Billing project can have multiple workspaces

Because you access data from all workspaces under a single Billing project in your single Cloud environment, you may need to move data out of at least one region if you have multiple workspaces with storage in different regions under a single Billing project.

Cloud costs - Network data transfer charges

If you can, keep all of your buckets and VMs in the same region to avoid network data transfer charges.

Network data transfer charges occur when moving data out of a Google Cloud region, most commonly when moving data from:

  • Cloud Storage bucket to Cloud Storage bucket
  • Cloud Storage bucket to Compute Engine VM

Examples of where data transfer charges may occur

When running analyses and workflow, be aware of data and VM locations, in particular:

  • Public resource files are often in US multi-regional buckets
    Accessing files stored in the US from a non-US VM will incur data transfer charges

  • Workflow VMs may hard-code or have default lists of zones that are in the US
    Accessing non-US files from VMs in the US will incur data transfer charges

  • Workspaces created before  September 27, 2021: Terra Billing projects can have a single Cloud Environment serving multiple workspaces
  • If workspace buckets have different locations, doing analyses from a single Cloud Environment VM will incur data transfer charges fetching data from outside the VM's region

Additional resources

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.