Where is data stored, accessed and analyzed when working in the cloud? How do you organize and track original and generated data? How do you share with colleagues? This article helps you understand Terra's data-in-the-cloud model so you can work more efficiently.
What does “data in a Terra workspace” actually mean?“Data in Terra” is actually stored in public cloud infrastructure that is integrated so you can organize, access, and analyze it without leaving Terra. Large unstructured files are stored in dedicated Azure blob storage containers. Tabular data is stored in a private relational database and displayed as workspace data tables.
Terra takes care of bringing input data from wherever it is to the VM running your analysis. Built-in data security features let Terra access controlled data you are authorized to use.
Data in the Cloud - A new vision for bioinformatics
Traditionally, each researcher copied and stored their own data in local repositories. In a cloud-based model, data is stored in central locations for easier access, reduced storage costs and copying errors, and streamlined centralized data privacy and security administration.
Traditional Bioinformatics(bring data copies to each person)
|
Cloud-based bioinformatics(bring each person to the data)
|
The Terra platform lets you take advantage of large datasets in the cloud without making or paying to store new copies.
How does Terra access data in the cloud?
Just because the primary data is stored outside of Terra doesn't mean you need to copy it to workspace storage to analyze it. When you analyze these large data files, you give the cloud location as input data and Terra takes care of localizing the data files to the virtual machine (VM) that runs the analysis. For controlled-access data, Terra verifies your linked authorization before localizing the data.
All the analysis is done on virtual machines in the cloud and only (some of) the generated data may be deposited in workspace storage.
Costs of storing and accessing data in Terra
You pay for cloud resources you consume to analyze, store, and move data. To learn more about the cloud costs for storing and accessing data in the cloud, and how you are charged, see Overview: Costs and billing in Terra.
Where does data analyzed in Terra live?
When you work on a local machine or data cluster, you know exactly where your data lives: on physical hard drives attached to the computer or cluster where you do your analysis. Data in the cloud, on the other hand, can seem distant and non-intuitive. Where is data “in the cloud” that you analyze in Terra actually stored? How do you pay for it?
Where your data is stored depends on what kind of data it is
Data “in a Terra workspace" falls into two different data types - tabular data and unstructured data. Each type has its own cloud storage mechanism in Terra.
- Unstructured data can be stored in the workspace's dedicated cloud storage (blob container) or the Cloud Environment Persistent Disk.
- Tabular data (including primary data like clinical records or lab results, or links to large files in cloud storage) is stored in data tables.
Where your data is stored may depend on your analysis
- Generated data from a workflow analysis is stored in workspace blob storage by default.
- Data generated in JupyterLab is stored in the Cloud Environment Persistent Disk by default.
Learn more about
- Unstructured data storage (workspace blob storage or Cloud Environment Persistent Disk)
- Tabular data storage (data tables)
WARNING You should not bring any controlled access data to Terra on Azure previewIt is a violation of US Federal Policy to store any Unclassified Confidential Information (ie FISMA, FIPS-199, etc.) in this platform at this time. Do not put this data in this platform unless you are explicitly authorized to by the manager of the Dataset or you have your own agreements in place.
Unstructured data (e.g., large genomic data files, images, TSVs)
Generally, large data files you analyze in Terra - and other unstructured data files you want to keep - will be stored in one or more of three cloud-based locations.
- Workspace cloud storage (dedicated blob container)
- Cloud Environment Persistent Disk (PD)
- External blob storage containers (i.e., Azure Genomic Data Lake).
1. Workspace cloud storage (dedicated Azure Blob storage container)
Each Terra workspace comes with a dedicated Azure Blob storage container, Microsoft’s unstructured object storage solution for the cloud. Blob storage containers are optimized for storing massive amounts of unstructured data (data that doesn't adhere to a particular data model or definition, such as text or binary data - see this Microsoft support doc for reference).
Workspace storage caveats
- Data generated in a workflow analysis (WDLs) is stored in workspace cloud storage (Blob storage container) by default.
- Data generated in JupyterLab is stored in the Cloud Environment persistent disk - not workspace blob container storage. You can move data generated in an interactive analysis to your workspace storage for more permanent storage, as generated data will be deleted when the JupyterLab VM is deleted or recreated.
- You can upload primary data stored locally to your workspace (blob) storage or persistent disk for analysis in Terra. If you need to upload data to workspace storage, see Overview: Bring your own data to Terra (Azure).
2. Cloud Environment Persistent Disk (PD)
Terra attaches a persistent disk (PD) to your JupyterLab Cloud Environment VM where you can save generated data and installed libraries and other files you want to retain even if you delete or update your Cloud Environment VM.
You choose the PD size when you create the Azure cloud environment. A minimal cost (per month) is associated with maintaining the disk (see Azure disk pricing). You will pay this cost even when the Cloud Environment is paused or deleted.
Persistent Disk (PD) storage caveats
- Data generated in JupyterLab is stored in the PD by default.
- Data stored in the PD is not accessible by your colleagues, even in a shared workspace. To allow colleagues to see data in your PD, or to use it as input for a workflow, you must copy the data to workspace blob storage.
To learn more, see Cloud Environment (Persistent Disk) storage.
3. Other (external) storage (i.e., open-access blob storage containers)
Ideally, the bulk of primary data you work with will be in external cloud storage (blob storage containers), which Terra can access for you as long as you have the right permissions and authorization. One example is data hosted by Azure’s Genomics Data Lake.
To access data in a private location, you will use a temporary Shared Access Signature (SAS) token. Open data does not require a SAS token. To learn more about this data access model, see Introduction to SAS Tokens.
Accessing blob data
To analyze files in blob storage containers, Terra will need the cloud location (i.e., URI) - even for data in your workspace storage. You can store the URL and other data file metadata in a data table (scroll down for more details on data tables), which functions like a spreadsheet built into your workspace.
Tabular data (i.e., clinical, demographic, or phenotypic data)
Data tables in workspaces store and organize tabular data in an integrated, spreadsheet-like format.
Data stored in a table in Terra
- Primary data in tabular format including clinical data, demographics, or phenotypic data
- Input data file locations (e.g., URLs for files in your workspace cloud storage or in external storage locations)
- Input data file metadata (e.g., dates of sample collection, or details about sample preparation)
To learn more, see Introduction to Data Tables.
Where is tabular data stored?
Data tables are hosted in a private relational database set up when you create a workspace. This makes data tables more scalable and gives you full control over where (what geographic location) your data lives in Azure. Data tables are copied to workspace clones in a new relational database.
Data tables are not copied to your workspace’s cloud storageThe data lives in a separate database. To learn more, see Data Tables: Additional resources.
Future vision
We are currently working towards making petabytes of genomic data available on Terra on Azure. In the near future, you will be able to browse Azure datasets available for analysis in the Terra ecosystem and hand off datasets to your workspace to analyze and combine with your own data.