Managing and organizing large amounts of data stored in multiple locations in the cloud can be challenging. Where is data stored? How do you access the data for analysis? How do you analyze it, and share results with colleagues? How do you organize and track original and generated data and where they are stored? This article helps answer some of these questions if you're working in Terra. Understanding Terra's data-in-the-cloud model can help you work more efficiently.
Data in the Cloud - A new vision for bioinformatics
Enabling researchers to take advantage of large datasets in the cloud is a fundamental driver of the Terra platform. Traditionally, each researcher copied and stored their own data in local repositories. In a cloud-based model, data are stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.
Where's the data in Terra?
Data that's not stored and analyzed locally can seem distant and not intuitive. When we talk about data in a Terra workspace, we're really talking about data that is linked in some way to your workspace, not data that's actually "in" your workspace. In many cases, when you analyze the data, you won't copy it at all to your workspace bucket - all the analysis is done in the cloud and only (some of) the generated data may be deposited in the workspace bucket.
Generally, data you and your colleagues analyze in a Terra workspace will be in one (or more) of three locations.
1. Your Interactive analysis app disk (PD)
The workspace Cloud Environment is a virtual computer or computers requested and set up by Terra (using Broad's Cromwell engine and GCP's API). When you spin up a cloud environment runtime, you'll set the amount of memory on your virtual disk and a detachable persistent disk (PD). When doing interactive analyses (Jupyter notebooks or RStudio, for example), the generated output is stored in your PD by default. Any data you want to share with colleagues or use as input for a workflow should be moved to more permanent (i.e. GCS) storage. See How (and when) to save data generated in a notebook to the Workspace bucket to learn more.
2. Your workspace cloud storage (Google bucket)
Data generated by a workflow analysis (WDLs) are stored by default in workspace cloud storage (i.e. Google bucket). You can move local data or data generated in an interactive analysis to your workspace storage. If you need to upload data to your workspace bucket, see Moving data to/from a Google bucket (workspace or external).
3. Other (external) storage
Ideally the bulk of data you work with will be in some other data storage in the cloud, which Terra can access for you as long as you have the right permissions and authorization. Examples include data in public- or controlled-access Google Cloud Storage buckets, data repository platforms such as Gen3 Data Commons, or data hosted elsewhere and accessed through the Terra Data Library. For more information, see Linking authorization/accessing controlled data on external servers.