Understanding data in the Cloud

Allie Hajian
  • Updated

Where are large amounts of cloud data stored, accessed, analyzed, and shared with colleagues?  How do you organize and track original and generated data? This article helps you  understand Terra's data-in-the-cloud model so you can work more efficiently. 

Data in the Cloud - A new vision for bioinformatics

The Terra platform enables researchers to take advantage of large datasets in the cloud. Traditionally, each researcher copied and stored their own data in local repositories. In a cloud-based model, data are stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.

Traditional Bioinformatics
(each researcher has own copy of data)

Data-Traditional-model_diagram.png

  • Data copying, not data sharing
  • High data storage costs
  • Nonreproducible results
  • Individual security implementations

Cloud-based bioinformatics
(bring researchers to the data)

Data_Cloud-based-bioinformatics_diagram.png

  • True, immediate data sharing
  • Minimized storage costs and copying errors
  • Streamlined data privacy and security administration 

Where do large data files used in Terra live?

Data that are not stored and analyzed locally can seem distant and nonintuitive. When we talk about "data in a Terra workspace", we're often talking about data that are linked in some way to your workspace, not data that are actually "stored in" your workspace.

In many cases, when you analyze data, you won't copy it to your workspace bucket at all. All the analysis is done in the cloud and only (some of) the generated data may be deposited in workspace storage or your Cloud Environment persistent disk.

Generally, data you and your colleagues analyze in a Terra workspace will be in one (or more) of three locations.

Data_in_cloud_3_locations.jpeg

1. Your Interactive analysis app disk (PD)

The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra. When you spin up a cloud environment runtime, you'll set the size and type of your detachable persistent disk (PD). When doing interactive analyses ( e.g., Jupyter Notebooks or RStudio), the generated output is stored in your PD by default. Any data you want to share with colleagues or use as input for a workflow should be moved to workspace storage (i.e., Google bucket) storage. See How (and when) to save data generated in a notebook to Workspace storage to learn more.

2. Your workspace cloud storage (Google bucket)

Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (i.e., Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see Moving data to/from a Google bucket (workspace or external).

3. Other (external) storage

Ideally, the bulk of data you work with will be in some other data storage in the cloud, which Terra can access for you as long as you have the right permissions and authorization. Examples include data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library. For more information, see Linking authorization/accessing controlled data on external servers.

Next step: Try the T101 Data Tables Quickstart

The T101 Data Tables Quickstart tutorial is a self-guided tutorial that includes everything you need to get hands-on using workspace data tables to organize, access and analyze data - including sets of data - in the cloud. . Copy the T101 Data Tables Quickstart workspace to your own billing account and work through the three exercises following the step-by-step guide

T101 Data Tables Quickstart tutorial workspace | Step-by-step guide

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.