Have a well crafted workspace you'd like to share with the Terra community? Here's what you need to know to feature your workspace!
Featuring a workspace will make it publicly readable to all Terra users and listed in the Showcases and Tutorials page under the Featured Workspaces list. This is a great opportunity for you to broadcast your work and for users to discover workspaces with ready-to-run WDL workflows and interactive Jupyter notebooks that can be repurposed for their own research.
The Showcases and Tutorials library includes a curated selection of some of the community's finest reproducible workspaces. Designed with collaboration in mind, these well-documented workspaces include sample data and dashboards that describe the contents of the work in enough detail to allow others to try on their own. The idea is to enable users to easily reproduce your work using public data and scripts preconfigured with the correct attributes so each workflow and/or notebook is ready to run.
Three types of Showcase Workspaces
Analysis-focused workspaces (e.g., Tetralogy of Fallot)
Generally associated with pre or post-publication studies, these highlight biological meaning/implication. The workspace will have a thorough description of the study, general motivation for the scientist/analyst’s experiment, caveats and concerns, and an ordered list of the case study steps. The workspace is a reproduction of the publication, with all the analysis tools, on the Terra platform.
Data-focused workspaces (e.g., Target, TCGA, etc.)
These workspaces focus on introducing users to the specific public or restricted-access datasets available in Terra’s Data Library. They bundle instructions for accessing and working with various cohorts and data types (per project) in the dashboard and include example workflows or notebooks that reproduce a typical analysis of the dataset. If data access is restricted, the workspace must include easy-to-follow instructions on how to gain access to the data and import the data to the workspace to run the workflows.
Workflow-focused workspaces (e.g., GATK workspaces)
These workspaces contain WDL workflows and/or notebooks with sample test data sufficiently small to be run in a reasonable time for a small cost. The dashboard of this workspace at a minimum has light documentation on the pipelines to describe the purpose, requirements, input, and output of each. Both workflows and notebooks should be preconfigured and ready to execute, with sufficient instructions that a user new to the platform but familiar with the science is able to run the scripts. Is there are multiple workflows that will be run back to back, they need to be named with numerical prefixes and configured to run seamlessly (for automated testing purposes). Workflows should be regularly updated to follow tool versions/evolutions.
If you have a workspace that fits these categories, or something different but similarly well-crafted, please sign up to have it featured! Fill out the Feature Workspace Intake Form and our team will contact you to begin the process.
Are you simply interested in allowing fellow Terra users access to view your workspace? You can have your workspace made public without going through the featuring process and requirements by contacting the Terra support team: firstname.lastname@example.org.
What to expect
After submitting the form, we'll review your workspace to see if it meets our FW requirements (below). If everything checks out, we will feature the workspace; if not, we'll provide suggestions on meeting the requirements. Note: To maintain a consistent tone on the Terra platform, we may make small editing changes to the documentation (both in the dashboard and the notebooks). We will ask you for final approval before featuring the workspace.
Featured workspace requirements
These requirements are intended to ensure that users have the best experience cloning and using the workspace batch analysis functionality (workflows) or its interactive analysis capability (Jupyter notebooks).
1. Required workspace components
All Featured Workspaces should include the following (where applicable).
- Dashboard documentation
- Should follow the Featured-Workspace-Template.
- WDL/JSON - workflow analysis component (if applicable)
- All relevant workflows should be imported to the designated workspace with all attributes pre-configured and ready to execute.
- Workspaces containing multiple workflows that need to be run sequentially should have the names numbered in the sequence in which the workflows need to be run with #-name. Example: “1-workflow, 2-workflow".
- The workflow can either be stored in Git, Dockstore, or Terra Method Repository and exported to a Terra workspace.
- Adequate description of the workflow explaining what the workflow(s) does, what input it accepts, and the output produced should be in the workspace dashboard. While not required, the cost and time of running the workflow(s) on example datasets is strongly recommended.
- Jupyter Notebook - interactive analysis component (if applicable)
- Each cell within the notebook should be ready to execute, it shouldn’t require user intervention.
- The dashboard should include an adequate description of what the notebook(s) does, what input it accepts, and the expected output.
- The dashboard should include the recommended Cloud Environment, including required packages and the minimum compute resources to run the notebook on sample data.
- Sample input data - for workflow (i.e. batch) or interactive (i.e. Notebook) analysis
- Confirm that data has consent for public access.
- Needs to be uploaded to a publicly-accessible, external Google bucket, separate from the Workspace bucket. This is to ensure the original path to the data in the cloned version of the workspaces is still functional and available (note that workspace buckets are not public, even if the workspace is, and data in the original workspace bucket is not copied to cloned copies of the workspace). Using a separate Google bucket also has the benefit of enabling requester pays, where the requester pays egress fees on downloaded data (and not the data owner).
- References/Resources - to run the analysis
- Reference and Resources should be listed in the Workspace Data Table under the Data tab.
- All files need to be publicly accessible and have consent for public access.
- Ensure compatibility with input data. For example, if input BAMs are aligned to hg38, the reference should be hg38.>
- Docker images - used in WDL workflows or custom notebook environment
- Must be publicly accessible.
- Add tags to the workspace so that they can be properly placed in the correct showcase categories.
WDLs, Jupyter Notebooks, RStudio, Galaxy, Hail, Bioconductor, GATK, Cumulus, Spark
GWAS, Exome Analysis, Whole Genome Analysis, Fusion Transcript Detection, RNA Analysis, Machine Learning, Variant Discovery, Epigenomics, DNA Methylation, Copy Number Variation, Structural Variation, Functional Annotation
Data Generation Technology
10x Analysis, Bisulfite Sequencing
Cancer, Infectious Diseases, MPG, Single-cell, Immunology
AnVIL, CMG, CCDG, TopMed, HCA, TARGET, ENCODE, BioData Catalyst, TCGA, 1000 Genomes, BRAIN Initiative, gnomAD, NCI, COVID-19
Format Conversion, Developer Tools
HCA, AnVIL, BRAIN Initiative, BioData Catalyst, NCI
2. Test your analysis tools
The components above must run successfully with valid results without human intervention (i.e. no renaming of variables, ordered workflows), and do what the dashboard documentation instructs.
Suggestion: have someone completely new to the workspace test it and provide usability feedback.
Terra is routinely updated, so we ask owners of the workspace to regularly test their workflows and notebooks to confirm all scripts run as expected.
3. Lock your workspace
Locking a workspace is a way to prevent collaborators (or any viewers, in a public workspace!) from modifying anything in that workspace. This is useful if you are showcasing a workspace and do not want any content to be deleted or modified.
You can lock your workspace by clicking the three vertical dot share icon and selecting the Lock workspace option in the dropdown menu.
Ready to feature? Contact Frontline
The Frontline Support team can “feature” the workspace and will do so once the workspace has been tested and is operating to the collaborators and support lead’s satisfaction. This will be confirmed before posting.
Preventing network egress charges
Network egress charges can be incurred whenever traffic leaves a Google Cloud region, such as copying data from a bucket or copying a Docker image from Container Registry or Artifact Registry. This can occur for copies to a VM in a different compute region or copies out of cloud (downloads). For more details, please see Google's network pricing documentation.
Network egress charges by default go to the Cloud Storage bucket or Docker image owner. In the case of data in Cloud Storage, the Requester Pays option can be configured on a bucket to pass on such charges to data user.
To avoid unexpected network egress charges in your Featured Workspace, we recommend the steps below.
1. Dashboard recommendations
Publish relevant Google Cloud location< information to end user in the workspace dashboard page (see suggested language below).
- "Example data for this workspace is in <bucket region> and the bucket is <requester pays or not>."
- "Reference data for this workspace is in <bucket region> and the bucket is <requester pays or not>."
- "The Docker image for this workspace is published in <image location>."
2. Reference or sample data storage
To prevent egress charges for the bucket owner or reference or example data in Cloud Storage, you can set up controls around the featured workspaces' project data - like a VPC-SC security perimeter or a requester pays bucket.
See Configure GCS to prevent egress charges for more details.
For instructions on how to turn on the "requester pays" option on an external GCS bucket, see the Google documentation.
3. Docker images recommendations
- Don't grant broad access to images in Google Container Registry or Artifact Registry (go ahead and use it internally and grant access to trusted users). Out-of-region workspace users who download Docker images from these registries will generate data egress charges for the owner of the Docker image (see this Cromwell github issue for context).
- Alternatively, use a different Docker registry for public access to your images and update your public WDL workflows to use this registry.
To learn more, please see Docker Image Publishers Tips or Configure GCR/Artifact Registry to prevent egress charges for details.
Additional Featured Workspace Resources
Want to create your own workspace but having a hard time getting started? Use this smartsheet project plan that contains several tasks normally involved in creating and featuring a workspace as a guide: Workspace Featuring Project Plan.
Already have a workspace featured and need us to archive and/or replace the workspace? Fill out the maintenance form: Workspace Maintenance Intake Form.
Please sign in to leave a comment.