Big Data - A blessing and a curse
Today there’s a treasure trove of datasets researchers could only dream about a decade ago. Vast amounts of genomic and phenotypic data from diverse participants that could help unlock the mysteries of diseases from schizophrenia to heart disease to cancer.
But dealing with vast amounts of data - whole genomes and decades of medical records for tens or hundreds of thousands of participants - can be a curse, too. Data that's too large and too expensive to store on local machines or share with collaborators. That's too hard to analyze with tools that are often complex, custom-made and challenging to configure and run at scale.
It’s like building the house of the future. You know all the materials and tools exist, and you can visualize the house you've been dreaming of for decades, the house that could be... But the lumber is at one store, the concrete and steel are at a second store, the bulldozer is at a worksite in a different town. And who knows what are the right "smart" appliances to use, or how to network them all together. Without a way to combine and coordinate the pieces in a reasonable amount of time for a reasonable cost, it’s hard to build that dream house.
The cloud - Softening the curse of Big Data
Working in the cloud softens some of the curse of Big Data. Cloud storage and compute scales to any size, and you only pay for what you use. Sharing data in the cloud is easier, faster and less error prone than sending petabytes of data on a hard drive in the mail. And it's easier for collaborators in a dozen time zones to share analysis tools that allow them to all work together in real time.
But moving your analysis to the cloud can seem like a daunting task. It takes time and expertise to set up and administer cloud storage; to move all that data while keeping it private; to learn to configure a virtual analysis environment; and to secure collaborations in a public cloud. Especially for scientists already under tight pressure to publish.
Enter Terra, a modular system designed to do the heavy lifting and help move your analysis to the cloud. The platform handles the logistics that can make working in the cloud seem beyond reach. Working in Terra means you can focus on the fun stuff: discovery.
Simplify your analysis in the cloud by keeping everything in a workspace
At the heart of working on Terra is a shareable computational workspace with everything you need to complete your project.
Create a workspace in Terra to help:
- Link to data in the cloud for analysis, instead of downloading and storing it yourself
- Keep data organized - no matter where in the cloud it is; whether you're analyzing a hundred, or a hundred thousand, files.
- Boost your statistics by combining data from different sources
- Visualize and analyze data of any size in real time
- Find and run bulk analysis tools even without (much) programming skill
- Make your results reproducible with publicly-vetted analysis tools and options to standardize your virtual computational environment
- Share analysis results and collaborate while keeping control with built-in security
Workspace functions at a glance
Expand each section below to learn how a workspace helps keep your project on track by keeping all the pieces together.
Documentation in the Dashboard
The landing page is your project overview - what questions you’re trying to answer, what kind of data and analysis tools you'll use, etc. Good documentation makes your analysis easy to share (including with your future self).
Workspace information includes workspace owners (these can be changed as needed) and Authorization Domain information (used to protect access to controlled data).
Store data in the workspace bucket
Each workspace has an associated Google bucket for storing
- Your own data (uploaded from a local system)
- Workflow outputs (stored by default in the workspace bucket)
- Notebook files (i.e.
Note that data generated by an interactive analysis in a notebook is stored in the virtual application machine and not in the workspace bucket. To keep this data safe, you will need to explicitly copy the data to the workspace bucket. Learn more about that process here.
To access your workspace bucket, click on the link at the bottom right in the dashboard:
Manage and organize data in the Data page
Keep track of project data in workspace tables. They're like spreadsheets built right into the workspace.
- Combine data from different studies or across datasets into one table to create a more robust dataset to analyze
- Connect data across tables with Universally Unique Identification numbers (UUIDs) or subject IDs (left column of bottom screenshot)
Genomic data - The sample table includes links to wherever large data files are in the cloud. UUIDs identify the sample data files. In this example, the collaborator IDs ties a participant's phenotypic data (in a separate table) to the phenotypic data.
Phenotypic data - The subject table can include complete medical, population or lab data. In this example, the subject ID connects a participant's phenotypic data with genomic data in a separate table.
Workspace Data table - This table contains workspace-level files required to analyze any sample. Examples include Docker or reference files:
Analyze and visualize data in real time in a Notebook
Launch an in-app Jupyter Notebook to interact with and visualize the data. Code can be Python or R. Notebooks include documentation to help organize and communicate the analysis steps.
Customize your virtual application
Notebooks run on a virtual application, and you can customize the environment and compute power of the virtual application for your notebook. Terra includes several built-in environments with popular packages such as Bioconductor and Hail. Alternatively, to control exactly the packages and libraries for your analysis, choose a custom Docker environment. Specifying the compute power in the virtual machine (or cluster) lets you interact with data of any size. You can document the options you use to allow others to reproduce the analysis.
Streamline bulk pipeline analysis with Workflows
You can collect, configure (set up) and run workflows for bulk analyses in the Workflows tab. These are the sorts of repetitive analyses that can be automated, such as aligning sequencer reads or calling variants. Workflows can be set up to take input directly from a workspace table and write output metadata back to the table. Configuring this way helps keep data organized as you run your analysis.
Not a coding expert?
Browse and import published workflows in Dockstore or the Broad Methods Repository by selecting the "Find a Workflow" option:
Monitor and Troubleshoot in the Job History page
Check on the status of workflow submissions here. The Job History maintains a record of every workflow submitted in the workspace. You can troubleshoot by selecting the workflow name in the "Submission" column:
To troubleshoot errors or submission failures, continue to select "View". You can access error logs of task-level errors by clicking on the icons in the "Log Files" column:
Collaborate in a shared workspace
Click on the three vertical dots at the top right to share the workspace with other people working in Terra. You'll control how much access to give collaborators, and you can change it at any time.
Making it all work together
Even with all the data in the world, you can’t make discoveries if you can’t store it, organize it, analyze it, and share your results. Like a construction site with all the building materials and tools you need close at hand and well organized, the workspace brings the data and tools and cloud resources you need together so you can focus on science.