Have a well crafted workspace you'd like to share with the Terra community? Here's what you need to know to feature your workspace!
Featuring a workspace will make it publicly readable to all Terra users and listed in Terra's library of showcase workspaces. This is a great opportunity for you to broadcast your work and for users to discover workspaces with ready-to-run WDL workflows and interactive Jupyter notebooks that can be repurposed for their own research.
The features workspaces library includes a curated selection of some of the community's finest reproducible workspaces. Designed with collaboration in mind, these well-documented workspaces include sample data and dashboards that describe the contents of the work in enough detail to allow others to try on their own. The idea is to enable users to easily reproduce your work using public data and scripts preconfigured with the correct attributes so each workflow and/or notebook is ready to run.
Three types of Showcase Workspaces
Analysis-focused workspaces (e.g., Tetralogy of Fallot)
Generally associated with pre or post-publication studies, these highlight a biological result. These workspaces allow you to reproduce the publication's main findings on Terra. They include a thorough description of the study, general motivation for the experiment, caveats and concerns, an ordered list of the case study steps, and all the analysis tools used to generate the study's findings.
Data-focused workspaces (e.g., Target, TCGA, etc.)
These workspaces focus on introducing users to specific public or restricted-access datasets available in Terra’s Data Library. They bundle instructions for accessing and working with various cohorts and data types in the dashboard and include example workflows or notebooks that reproduce a typical analysis of the dataset. If data access is restricted, the workspace must include easy-to-follow instructions on how to gain access to the data and import the data to the workspace to run the workflows.
Workflow-focused workspaces (e.g., GATK workspaces)
These workspaces contain WDL workflows and/or notebooks, along with sample test data that are sufficiently small to be run in a reasonable time, at a small cost. These workspaces' dashboards include at least light documentation to describe the purpose, requirements, input, and output of each workflow. Both workflows and notebooks should be preconfigured and ready to execute, with sufficient instructions that a user who is new to the platform but familiar with the science can run the scripts. If there are multiple workflows that will be run back to back, they need to be named with numerical prefixes and configured to run seamlessly (for automated testing purposes). Workflows should be regularly updated to follow tool versions/evolutions.
Featured workspace requirements
These requirements are intended to ensure that users have the best experience cloning and using the workspace batch analysis functionality (workflows) or its interactive analysis capability (Jupyter notebooks).
1. Required workspace components
All Featured Workspaces should include the following (where applicable).
-
Dashboard documentation
- Should follow the Featured-Workspace-Template.
-
WDL/JSON - workflow analysis component (if applicable)
- All relevant workflows should be imported to the designated workspace with all attributes pre-configured and ready to execute.
- If your workspace has multiple workflows that need to be run sequentially, they should be named in the order in which they need to be run, following the format #-name. For example,
1-calculateSum
,2-calculateAverage
. - Workflows can be stored in GitHub, Dockstore, or the Broad Methods Repository and exported to the workspace.
- In your workspace's dashboard, include a description of the workflow explaining what the workflow(s) does, what input it accepts, and the output produced. While not required, we strongly recommend including the approximate cost and time of running the workflow(s) on example datasets.
-
Jupyter Notebook - interactive analysis component (if applicable)
- Each cell within the notebook should be ready to execute without user intervention (e.g., to set a variable).
- The workspace's dashboard should include an adequate description of what the notebook(s) does, what input it accepts, and the expected output.
- The dashboard should include the recommended Cloud Environment, including required packages and the minimum compute resources to run the notebook on sample data.
-
Sample input data - for workflow (i.e. batch) or interactive (i.e. Notebook) analysis
- Confirm that you have permission to make the data publicly accessible.
- The data are uploaded to a publicly-accessible, external Google bucket, separate from the Workspace bucket. This is to ensure the original path to the data remains functional when a user clones the workspace to their own billing project. Note that workspace buckets are not public, even if the workspace is, and data in the original workspace bucket is not copied to cloned copies of the workspace. Using a separate Google bucket also has the benefit of enabling requester pays, where the requester pays egress fees on downloaded data (and not the data owner).
-
References/Resources - to run the analysis
- References and resources should be listed in the Workspace Data Table in the Data tab.
- All files need to be publicly accessible and have consent for public access.
- Ensure compatibility with input data. For example, if the input BAMs are aligned to hg38, the reference should be hg38.
-
Docker images - used in WDL workflows or custom notebook environment
- Must be publicly accessible.
-
Add tags to the workspace so that they can be properly placed in the correct showcase categories.
Filters/Categories
Tags
Analysis Tools
WDLs, Jupyter Notebooks, RStudio, Galaxy, Hail, Bioconductor, GATK, Cumulus, Spark
Experimental Strategy
GWAS, Exome Analysis, Whole Genome Analysis, Fusion Transcript Detection, RNA Analysis, Machine Learning, Variant Discovery, Epigenomics, DNA Methylation, Copy Number Variation, Structural Variation, Functional Annotation
Data Generation Technology
10x Analysis, Bisulfite Sequencing
Scientific Domain
Cancer, Infectious Diseases, MPG, Single-cell, Immunology
Datasets
AnVIL, CMG, CCDG, TopMed, HCA, TARGET, ENCODE, BioData Catalyst, TCGA, 1000 Genomes, BRAIN Initiative, gnomAD, NCI, COVID-19
Utilities
Format Conversion, Developer Tools
Projects
HCA, AnVIL, BRAIN Initiative, BioData Catalyst, NCI
2. Test your analysis tools
The components above must run successfully with valid results without human intervention (i.e. no renaming of variables, ordered workflows), and do what the dashboard documentation instructs.
How to ensure that your workspace runs smoothly
- Have someone completely new to the workspace test it and provide usability feedback.
- Test your workflows and notebooks regularly to confirm that all scripts run as expected. Terra is routinely updated, so it's necessary to do this regularly to ensure that your workspace remains functional.
3. Lock your workspace
Locking a workspace is a way to prevent collaborators (or any viewers, in a public workspace!) from modifying anything in that workspace or deleting the workspace entirely. This is useful if you are showcasing a workspace and do not want any content to be deleted or modified.
Only owners can lock a workspace. You can lock your workspace by clicking the three vertical dot share icon and selecting Lock in the dropdown menu.
When the workspace is locked, you will see a closed lock icon next to the three vertical dot share icon.
Locking a workspace doesn't prevent collaborators from modifying the data Locking your workspace does not prevent collaborators from changing the data in the workspace’s storage bucket. Anyone with “writer” or “owner” access to the workspace will still be able to access and modify its data via Google Cloud tools or a command-line interface.
To prevent collaborators from modifying the workspace’s underlying data, lock the workspace and change all collaborators’ access permissions to reader.
How to feature your workspace
If you have a workspace that fits the categories and requirements described in this article - or something different but similarly well-crafted - please sign up to have it featured! Fill out the Featured Workspace Intake Form and our team will contact you to begin the process.
What to expect
After submitting the form, we'll review your workspace to see if it meets our requirements (below). If everything checks out, we will feature the workspace; if not, we'll provide suggestions on meeting the requirements.
Note: To maintain a consistent tone on the Terra platform, we may make small editing changes to the documentation (both in the dashboard and the notebooks). We will ask you for final approval before featuring the workspace.
Are you simply interested in allowing fellow Terra users access to view your workspace? You can have your workspace made public without going through the featuring process and requirements by contacting the Terra support team: support@terra.bio.
Preventing network egress charges
Network egress charges can be incurred whenever traffic leaves a Google Cloud region, such as copying data from a bucket or copying a Docker image from Container Registry or Artifact Registry. This can occur for copies to a VM in a different compute region or copies out of cloud (downloads). For more details, please see Google's network pricing documentation.
Network egress charges by default go to the Cloud Storage bucket or Docker image owner. In the case of data in Cloud Storage, the Requester Pays option can be configured on a bucket to pass on such charges to data user.
To avoid unexpected network egress charges in your Featured Workspace, we recommend the steps below.
1. Dashboard recommendations
Publish relevant Google Cloud location information in the workspace dashboard page (see suggested language below).
- "Example data for this workspace is in <bucket region> and the bucket is <requester pays or not>."
- "Reference data for this workspace is in <bucket region> and the bucket is <requester pays or not>."
- "The Docker image for this workspace is published in <image location>."
2. Reference or sample data storage
To prevent egress charges for the bucket owner or reference or example data in Cloud Storage, you can set up controls around the featured workspaces' project data - like a VPC-SC security perimeter or a requester pays bucket.
See Configure GCS to prevent egress charges for more details.
For instructions on how to turn on the "requester pays" option on an external GCS bucket, see the Google documentation.
3. Docker image recommendations
- Don't grant broad access to images in Google Container Registry or Artifact Registry. It's fine to share these images with trusted users, but out-of-region workspace users who download Docker images from these registries will generate data egress charges for the owner of the Docker image (see this Cromwell github issue for context).
- Instead, use a different Docker registry for public access to your images and update your public WDL workflows to use this registry.
To learn more, please see Docker Image Publishers Tips or Configure GCR/Artifact Registry to prevent egress charges for details.
Additional Featured Workspace Resources
Want to create your own workspace but having a hard time getting started?
Use this Smartsheet project plan that contains several tasks normally involved in creating and featuring a workspace as a guide: Workspace Featuring Project Plan.
Already have a workspace featured and need us to archive and/or replace it?
Fill out the maintenance form: Workspace Maintenance Intake Form.