This step-by-step tutorial uses the Intro-to-HCA-data-on-Terra workspace to import, access, and analyze Human Cell Atlas (HCA) data with community-supported single-cell tools like Bioconductor, Cumulus, Pegasus, Scanpy, and Seurat.
The Human Cell Atlas (HCA) project is a global initiative aiming to create comprehensive reference maps of all human cells. Researchers can share and find HCA single-cell data through the HCA's Data Coordination Platform (DCP), the project's data contribution, access, and analysis service.
The DCP maintains the HCA Data Portal where you can browse HCA projects and find data of interest (see the portal's guides to get started).
The Data Portal hosts multiple data types, including metadata files, raw sequencing files (FASTQs), contributor-generated matrices that contain cell by gene count data from the project contributors, and pre-processed data (BAMs and cell by gene count matrices in Loom format) generated using standardized pipelines approved by HCA Analysis Working Group.
This tutorial provides step-by-step instructions for finding HCA data, importing it into a Terra workspace, and analyzing it with workflows and Jupyter Notebooks. Although this tutorial uses an example DCP-generated project count matrix (read more in the Data Portal's matrix overview), you can use the import steps for all data types.
Clone the Intro-to-HCA-data-on-Terra workspace
Before you begin, create your own editable copy (clone) of the Intro-to-HCA-data-on-Terra workspace by following the instructions in the "How to clone a workspace" guide. For this tutorial, you do not require an authorization domain.
After cloning, take a few minutes to explore the workspace.
Look at the Data page and explore the participant data table. This data table is preloaded with this tutorial's example data.
When you import any HCA data to Terra (as you'll do in the next section), this participant data table is automatically created. The name "participant" was chosen by the DCP team, but you can (re)name Terra data tables as you need.
Data tables do not actually contain the raw data, but rather they contain links to the data in a cloud location, like a Google bucket.
Next, select the Workflows page which contains the Cumulus workflow for large-scale single-cell analysis- we'll run this workflow in a later section.
Lastly, glance at the Notebooks page; it has notebooks for single-cell analysis using four community-supported tools: Bioconductor, Pegasus, Scanpy, and Seurat.
Import HCA data from the HCA Data Portal
HCA data is imported from the DCP's Data Portal. For this tutorial, we'll show you how to import DCP-generated cell by gene matrices (Loom file format) for the 10x single-cell RNA sequencing project SingleCellLiverLandscape.
We've already imported this data for you in the Intro-to-HCA-data-on_Terra workspace, but you can either repeat these steps or use them to find and import any additional data files.
- Navigate to the Data Portal's Data Browser
At the top of the page, you'll notice a faceted search box that allows you to filter HCA data by project title, donor information, tissue types, disease states, and more. At the bottom of the page, you'll see a list of HCA projects.
- In the Search all filters field of the faceted search box, start typing the project's short name ("SingleCellLiverLandscape...")
A list of project titles that match the search terms will appear in the search box. Select the tutorial project (SingleCellLiverLandscape).
- In the project list, select the check box left of the project name
- Choose the
Export Selected Dataicon
- Under the
Export to Terrasection, select
- Select the species (only "Homo sapiens" option available for this project)
- Select the files to export; for cell by gene count matrices, choose "loom"
You could also choose the raw FASTQ files, the standardized BAM files (one for each library preparation), or additional file types provided by the project contributor. The types of contributor-generated files will vary across projects (some will have CSV, h5ad, etc.).
It will take a few seconds to prepare the export.
- When the export is ready, select the link
You'll be redirected to the Terra platform.
- Choose Start with an existing workspace
- Type the name of your cloned workspace or choose it from the drop-down menu
- Select Import
- Refresh the workspace Data page to view your data
Terra will automatically generate a participant data table for the selected data. If you (re)import the example matrices for this tutorial's example, Terra will overwrite the existing data table. If you import additional data files, Terra will add them to the existing participant data table.
- Explore the participant data table
Look at the different columns of the participant data table. The first column contains a participant ID, which is a unique UUID for the analysis bundle.
Each row of the table represents one analysis bundle. The first row contains information and files for the whole project (combined data for all 5 donors); the subsequent rows contain information/files for each of the 5 individual donors (one donor per row).
The subsequent table columns contain information about the individual data files as well as metadata relating to the project's donors, organs, developmental stages, disease states, and data processing criteria.
Information pertaining to this tutorial's example count matrix files can be found in columns with the "_loom_" prefix. As you scroll through these columns, remember, the data table does not contain the actual loom files but rather links to the loom files in their cloud location (a Google bucket in this case). To see this, go to the __loom__file_drs_uri column and click on one of the links. It will launch a File Details window.
In the File Details, you can view the file in its Google bucket location, download the file or use command-line tools to access the file.
What are DRS URIs?
The cloud location links for each matrix file are in the form of a DRS (Data Repository Service) URI listed in the __loom__file_drs_uri column. DRS URIs are cloud-agnostic identifiers that allow Terra to pull files from diverse repositories (Amazon Web Services, Azure, Google bucket, etc.). Read more about the identifiers in the overview Data Access with the GA4GH Data Repository Service (DRS).
Filter, cluster and normalize a 10x count matrix using Cumulus
Cumulus is a cloud-based analysis workflow for large-scale single-cell and single-nucleus RNA-seq data (Li et al. 2020). The workflow takes in a raw count matrix and performs filtering, normalization, and clustering.
The Cumulus workflow in the Intro-to-HCA-data-on-Terra workspace is already set up to read the tutorial's project matrix (sc-landscape-human-liver-10XV2.loom) which is listed in the workspace participant data table. This project matrix contains raw counts for 10x liver data processed with the Optimus pipeline.
To learn more about the different types of matrices, read the Data Portal's Matrix Overview.
- Go to the participant data table on the Data page and select the checkbox left of the participant_ID
- Select the three vertical dot icon and choose “Open with”
- Select Workflow
- Select the Cumulus workflow
You will be redirected to the workflow setup page.
- Explore the workflow Input and Output tabs
The workflow is set up to run Cumulus on the Loom file listed in the __loom__file_drs_uri column of the participant table.
Additionally, there are multiple Cumulus parameters set up for you which you can learn in the Cumulus overview.
- Go to the workflow Inputs tab
- Make sure the input_file variable is specified as the column with the Optimus loom file
For this tutorial, the input_file attribute should be “this.__loom__file_drs_uri,” which points to the cloud location for the example project matrix.
- Change the input attribute for the output_directory variable to a Cumulus folder in your Workspace Google bucket
The attribute is preset to “gs://PASTE_BUCKET_ID_HERE/Cumulus/”. Just paste your Google bucket ID over the PASTE_BUCKET_ID_HERE section.
You can find the Google bucket location on the right of your workspace Dashboard.
For example, if your Google bucket ID is fc-c3efdef7-76f2-4ffb-8e29-479bf9758df9, you would use the following string for the attribute:
- Check the output_name attribute
The output_name is used to name the output files and an output file folder. By default, it is set to "Liver" because this tutorial uses example liver data. You can change this value to any string that is meaningful to your selected data.
- Select Save
- Do not specify any output attributes in the Workflow Outputs tab
Cumulus will place outputs in the directory specified on the inputs tab, which in this example will write to the Files subsection of the Other Data section on the workspace Data page. Cumulus is not currently configured to write outputs to a Terra data table.
- Select Run Analysis and then Launch
You will be redirected to the Job History page. When the workflow successfully completes, you will see a green checkmark and the word “Succeeded”.
- Go to the Data page and select Files in the Other Data section
- Select the “Cumulus” folder and then the “Liver” folder to see the outputs
Take a look at the different file types Cumulus produces, which include differential expression files (de.xlsx), as well as visualization files. You can check the Cumulus documentation for a full description.
Some of the outputs will have a “.scp” suffix; these are outputs that are compatible with Single Cell Portal, a platform allowing you to visualize cell clusters identified by Cumulus.
Explore HCA data using Jupyter Notebooks
You can process and visualize single-cell data using common community-supported tools that can run in a Jupyter Notebook. The Intro-to-HCA-data-on-Terra workspace has four notebooks with example code to help you get started with single-cell analysis in R or python environments.
Each tutorial is designed to help you :
- Import a project matrix into your Cloud Environment
- Subset the project matrix to a demo size for the tutorial
- Filter cells based on quality control metrics
- Normalize cell counts
- Cluster and visualize cells
- Test clusters for differential gene expression
The Bioconductor notebook (R environment) uses a modified version of the "Orchestrating single-cell analysis with Bioconductor" (Amezquita et al., 2020) tutorial to analyze the project count matrix for SingleCellLiverLandscape. It converts the Loom matrix into a SingleCellExperiment object that can be used with Bioconductor tools.
Pegasus is a python package for single-cell analysis and also part of Cumulus. The Pegasus notebook uses a modified version of the Pegasus Tutorial to analyze the project count matrix for SingleCellLiverLandscape. It reads the Loom files using a custom function, read_input(), and then performs downstream filtering, normalizing, clustering, and differential expression analysis.
The Scanpy notebook (python environment) uses a modified version of the "Preprocessing and clustering 3k PBMCs" tutorial to convert the SingleCellLiverLandscape project Loom matrix into an AnnData object that can be analyzed with Scanpy.
The Seurat notebook (R environment) uses the sceasy package to convert the SingleCellLiverLandscape project Loom matrix to a Seurat object that can be used with a modified version of Seurat's Guided Clustering Tutorial.
- Before starting, use the table below to identify the Cloud Environment configuration needed for the tool you want to try:
Notebook Application Configuration Bioconductor R/Bioconductor Pegasus Pegasus (community-maintained environment) Scanpy R/Bioconductor Seurat R/Bioconductor
- Select the Cloud Environment widget at the top right corner of the workspace
- Select the Create custom environment
- Based on Step 1 above, select the appropriate option from the Application configuration drop-down
For example, if using the Bioconductor notebook, select the R/Bioconductor configuration.
- Customize the compute to 4 CPUs
- Select Create
- From the Notebooks page, select the tutorial notebook you want to use
- Click on the notebook's edit tab and follow the notebook instructions to analyze and visualize the single-cell data
Congratulations! You’ve completed the Intro-to-HCA-data-in-Terra Tutorial.
To learn more about how to use Terra or the different tools showcased in this tutorial, see the Resources section below.
Learn more with Terra and single-cell resources
Terra support documentation, videos, and workspaces
Get started on Terra with Terra support materials. These resources will teach you the basics of creating and using data tables, importing and running workflows, and using Jupyter Notebooks.
- Terra support documentation
- Terra support videos
- Terra Data Tables Quickstart Featured Workspace
- Terra Workflows Quickstart Featured Workspace
- Terra Notebooks Quickstart Featured Workspace
- HCA Data Coordination Platform and Data Portal for data contribution and guides to using HCA data
- Optimus Featured Workspace in Terra for preprocessing 10x single-cell data
- Cumulus Featured Workspace in Terra for filtering, normalizing, and clustering single-cell data
- Bioconductor for single-cell analysis tools in R
- Pegasus for single-cell analysis in python
- Scanpy for single-cell analysis in python
- Seurat for single-cell analysis in R
Check the workspace changelog for the latest updates to the workspace.
Want to add your tool to the workspace?
Add your comments and feedback using the Terra Featured Workspace community forum.
- Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods 17, 137–145 (2020). https://doi.org/10.1038/s41592-019-0654-x
- Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018). https://doi.org/10.1038/nbt.4096
- Dissecting the human liver cellular landscape by single cell RNA-seq reveals novel intrahepatic monocyte/ macrophage populations
- Li, B., Gould, J., Yang, Y. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat Methods 17, 793–798 (2020). https://doi.org/10.1038/s41592-020-0905-x
- MacParland, Sonya A et al. “Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations.” Nature communications vol. 9,1 4383. 22 Oct. 2018, doi:10.1038/s41467-018-06318-7
- Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019 Jun 13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031.