Gen3 is an open-source, cloud-based platform that allows researchers to store and search data hosted by consortia like NHLBI BioData Catalyst or AnVIL. This article gives instructions of how to access Gen3 data in Terra, and describes the structure of the data on both the Gen3 and Terra platforms.
- Accessing the Gen3 platform - BioData Catalyst users
a. Via direct browser link to Gen3 platform
b. Via Terra Data Library
- Gen3 data structure
- How to filter Gen3 data and export to Terra
- How Gen3 data is organized in Terra
- Resources for manipulating Gen3 data in Terra
1. Accessing the Gen3 platform - BioData Catalyst users
To export Gen3 data to Terra, you will first need to access the Gen3 platform for BioData catalyst in one of two ways:
- In a browser, via the https://gen3.biodatacatalyst.nhlbi.nih.gov/ link
- Via the Terra Data Library (step-by-step instructions below)
Once inside the Gen3 platform, follow the guidelines in Sections 2 and 3 below to find the data you need and export it to Terra.
How to access Gen3 via the Terra Data Library:
1. Click on the Terra main menu (the three bar icon in the upper left). From the "Library" drop-down, choose "Data":
2. Scroll down the library page to the NHLBI TOPMed icon
3. Select " TopMed presented by NHLBI BioData Catalyst ". This will redirect you to the Gen3 platform. From there you can learn about the structure of Gen3 data, explore and filter the data, and export it to Terra using the information in sections two and three below.
2. Gen3 (BioData Catalyst) data structure
Before working with Gen3 data in Terra, it is helpful to understand the structure of data on the Gen3 platform. To find out what data are available and how they are organized, the Gen3 data "Dictionary" gives an overview of the data model for the NHLBI Biodata Catalyst project. You can access the complete graph structure in the Dictionary by following the link https://gen3.biodatacatalyst.nhlbi.nih.gov/DD or by selecting the Dictionary icon in the upper right of the Gen3 platform.
Below are important characteristics of Gen3 data to note.
2.1. Gen3 metadata is organized by hierarchical nodes
The graph structure includes metadata entities as individual nodes in a tree-like format. Hierarchical nodes are connected by a line. For example, in the Gen3 Dictionary for BioData Catalyst shown below, all clinical metadata nodes (in blue) are tied together by a "Subject" node. "Subject" refers to each participant in a study. The "Subject" node is further connected to participants' biospecimen and sequencing data through the "Sample" node.
2.2. Gen3 nodes are connected by UUIDs
Lines in the diagram above show how metadata nodes are connected. Each entry in a Gen3 node has a Universally Unique Identifier (UUID). The UUIDs will help you connect a subject's metadata variables together (following the lines you see in the graph). When you select one of the nodes and open the property box, you will find a property that represents the node immediately upstream of the one you selected. For example, in the Laboratory Results node, you will see a metadata field named "Subjects". This field contains UUIDs for the "Subject" node that is immediately upstream of the Laboratory Result node. All Gen3 nodes will have a UUID field for the node immediately upstream. These UUIDs are important for manipulating Gen3 data in Terra and are discussed in more detail in Section 4.
2.3. Selecting Gen3 nodes allows you to explore the available metadata fields
You can explore and select each metadata node to see what types of metadata fields are available in TOPMed datasets. For example, selecting "Laboratory Results" (in the above image) will open a table (see below) of the all the possible metadata fields that fall in the node, as well as descriptions of each field. Notice that some metadata fields are optional. The "Required" column of the table identifies those fields that are required, if that node is available for a project. For example, the subjects property is required because this holds the UUIDs discussed above in 2.2.
Remember that not every study will have all clinical nodes you see in the graph. The graph represents all metadata available for the entire BioData Catalyst project.
2.4. Gen3 metadata may be harmonized
Harmonized phenotypic data variables, which are consistently defined across BioData Catalyst projects, enable unified analyses across multiple projects. The TOPMed Data Coordinating Center selects variables for harmonization with the assistance of study and phenotype experts. You can identify harmonized variables by the "HARMONIZED" tag in the "Description" column of each Gen3 metadata table. Unharmonized metadata and other various genomic files (such as multi-sample VCFs) from a project can be found in the "Reference File" node that is connected by a black line below the "Project" node in the Gen3 dictionary.
3. How to access and filter Gen3 data and export to Terra - BioData Catalyst users
When you know what data you need for your analysis, you can use the information above to explore the metadata fields on the Gen3 platform. Then you can filter and export the Gen3 data to a Terra workspace using the instructions below.
Accessing Gen3 data: Step-by-step instructions
1. Navigate to the Gen3 platform: https://gen3.biodatacatalyst.nhlbi.nih.gov/ or via Terra's Data Library
2. In the top menu bar, select the Exploration tab
On the left hand side, you will see the option to filter data using three specific metadata entities: Medical History, Diagnosis, and Subject.
3a. To filter by project
Select the Subject tab and scroll down to the Project id for your project of interest:
To further filter by some of the harmonized variables, select the "Medical History" or "Diagnosis" tabs and filter . This can help you narrow your results to specific individuals within a project that match your filter.
3b. To filter by some of the harmonized variables
(helpful in identifying which projects have the variables of interest to you) you can skip selecting a Project and go straight to the "Medical History" or "Diagnosis" tabs and filter by some of the harmonized variables.
4. Click on the checkbox to the left of the Project ID
After a slight delay (~10 seconds), an "Export All to Terra" option will become available in the top menu of the platform. You can use this button to export all data for the entire project.
5. When you are done filtering, select the "Export All to Terra" icon
An export in progress banner will appear at the bottom of the page as well as a progress circle in the middle. This can take a few minutes. When the export is complete, you will be redirected to the Terra website. A prompt will appear asking you to select an existing workspace or to create a new workspace.
You can select from template workspaces tailored to different Gen3 data analysis needs. The number and variety of template workspaces is growing. Current template workspaces include:
- BioData Catalyst GWAS 1000 Genomes Tutorial
- TOPMed Aligner Gen3 Data
- BioData Catalyst GWAS blood pressure trait
7. Select the "Import" icon
You will be directed to the destination Terra workspace. There will be an import notification in the upper right of the workspace.
Note that Terra data imports occur asynchronously, and data will appear in the data tab gradually within a few minutes. A new notification will appear when the asynchronous import is complete. To see the newly imported data, refresh the Data tab.
4. How Gen3 data is organized in Terra
When you export a project's data to a Terra workspace, the available Gen3 metadata are stored as tables in the Data tab. Each node in the Gen3 Data graph is its own separate "entity" table. For example, the Gen3 platform's "Laboratory Results" node (diagram below left) is exported to a Terra data table labeled "lab_result" (diagram below right).
To view the available data, click on one of the data tables to open it, and look closely at the column headers. The table's first column has a metadata ID header composed of the metadata (node) name followed by "_id". For example, the first column header in the "lab_result" table is "lab_result_id". Similarly, the first column header in the "subject" data table is labeled "subject_id". Each table row has its own unique metadata-specific UUID, which can connect different hierarchical levels of metadata.
Each additional column contains a project-specific, Gen3 metadata field under the associated Gen3 node, in alphabetical order by headers. For example, in the lab_results table, the column headers contain the same metadata fields listed in the Gen3 platform's "Laboratory Results" node (shown below), such as "age_at_basophil_ncnc_bid":
Note about available columns
- Because there can be many fields under a node, the data table can be quite long, even though you may only use a few of the fields in an analysis.
- Not all projects listed in the Gen3 platform have data available for each metadata field. When you export a project to a Terra workspace, data tables only include metadata fields that are available for the exported project.
- The first column for Gen3 data in Terra is a metadata UUID. For example, the subject_id in the "subject" table is not the same as the Subject identifier provided by TOPMed (which is the "submitter_id" - see below for more details on this!). The TOPMed subject ID you are used to can be quite far along to the right end of the table.
Connecting Gen3 data types in Terra
1. Connecting Gen3 data with metadata UUIDs
One way to link phenotypic and genotypic data, which are on different node levels on the Gen3 platform, is with metadata UUIDs. When linking data, remember that Gen3 metadata nodes are hierarchical. All metadata related to the same immediate upstream node in the Gen3 platform will share UUIDs for the upstream node in their Terra data table. In the sample Gen3 tree below, you can see that the Subject and Publication nodes share the Study UUID, while Diagnosis, Lab result, and Demographic shared the Subject UUID. To link the Lab result to the study requires knowing both the Subject UUID and the Study UUID associated with that Subject.
2. Connecting Gen3 data with submitter_ids
Another way to link BioData Catalyst biospecimen and genotypic data back to phenotypic data is by using the "submitter_id" column in the Terra data tables. Note that because they are in alphabetical order, the "submitter_id" column can be far to the right in the table, and you may have to scroll a ways to find it!
The submitter_id property is the calling card/nickname/alias for a unit of submission from its original source. For example, in the TOPMed project, there are aliases given to identify an individual's phenotypic and genotypic data. At the hierarchy of clinical nodes in the Gen3 graph, all clinical nodes will contain the phenotypic alias in the submitter_id column (for example, "DBG..." in the image below).
To connect an individual's phenotypic data to their genotypic data, you will go through the "Sample" biospecimen node in the Gen3 graph. The submitter_id in this node will hold the sequencing ID for a participant (the "NWD..." in the image below).
TOPMed Researchers: "subject_id" versus "submitter_id"
The subject_id for Gen3 data in Terra is a metadata UUID, and is not the same as the Subject identifier provided by TOPMed.
TOPMed researchers may be familiar with the terms "subject ID" to identify a study's participant ID or "NWD ID" to identify a participant's sequencing data. In the Gen3 platform, these IDS are listed in the submitter_id field.
5. Resources for organizing and analyzing Gen3 data in Terra
BioData Catalyst/Gen3-focused Showcase and Tutorial Workspaces
To get hands-on experience manipulating Gen3 data, please see the Showcase Workspaces related to Gen3 data (search for "Gen3" using Control + F in the Showcase Workspaces page). These walk you through different tools for interacting with Gen3 data, including interactive Jupyter Notebooks and batch pipelining workflows.
BioData Catalyst-specific Template Workspaces
Workspace setup to use TOPMed aligner with data provided by the Gen3 system
BioData Catalyst GWAS blood pressure trait
Workspace setup to perform a Genome-Wide Association Study (GWAS) for the blood pressure trait using data provided by the Gen3 system
Have additional questions or feedback? Please contact us!
You can also explore the Terra knowledge base for more general questions on using data and running workflows in Terra.