Understanding and using Gen3 data in Terra

Liz Kiernan
  • Updated

Gen3 is an open-source, cloud-based platform that helps researchers store and search data hosted by consortia like NHLBI BioData Catalyst. Learn how to access Gen3 data in Terra and describe the structure of data on both the Gen3 and Terra platforms. 

Note that AnVIL data is now provided through the  AnVIL Data ExplorerThe AnVIL Gen3 data portal is no longer available.

To learn how to link your authorization to Terra and bring data
from Gen3 to a Terra workspace, see this four-minute video

Accessing the Gen3 platform (BioData Catalyst users)

To export Gen3 data to Terra, access the Gen3 platform for BioData catalyst in one of two ways:

Once inside the Gen3 platform, follow the guidelines in Sections 2 and 3 below to find the data you need and export it to Terra

How to access Gen3 via the Terra Data Library

1. Click on the Terra main menu (the three bar icon in the upper left).

2. From the Library drop-down, choose Data.

Data_library_datastage.png

3. Scroll down the library page to the NHLBI TOPMed icon.

4. Select TopMed presented by NHLBI BioData Catalyst This will redirect you to the Gen3 platform. From there, you can learn about the structure of Gen3 data, explore and filter the data, and export them to Terra using the information in sections two and three below.

Gen3 (BioData Catalyst) data structure

Before working with Gen3 data in Terra, it's helpful to understand the structure of data and how it is organized on the Gen3 platform. The Gen3 data "Dictionary" gives an overview of the data model for the NHLBI Biodata Catalyst project. You can access the complete graph structure in the Dictionary by following the  https://gen3.biodatacatalyst.nhlbi.nih.gov/DD link or by selecting the Dictionary icon in the upper right of the Gen3 platform.

Gen3 metadata is organized by hierarchical nodes

The graph structure includes metadata entities as individual nodes in a tree-like format.  Hierarchical nodes are connected by a line. For example, in the Gen3 Dictionary for BioData Catalyst shown below, all clinical metadata nodes (in blue) are tied together by a "Subject" node. "Subject" refers to each participant in a study. The "Subject" node is further connected to participants' biospecimen and sequencing data through the "Sample" node. 

Subject_Clinical_Data.png

Gen3 nodes are connected by UUIDs

Lines in the diagram above show how metadata nodes are connected. Each entry in a Gen3 node has a Universally Unique Identifier (UUID). The UUIDs will help you connect a subject's metadata variables together (following the lines you see in the graph). When you select one of the nodes and open the property box, you will find a property that represents the node immediately upstream of the one you selected. For example, in the Laboratory Results node, you will see a metadata field named  "Subjects". This field contains UUIDs for the "Subject" node that is immediately upstream of the Laboratory Result node. All Gen3 nodes will have a UUID field for the node immediately upstream. These UUIDs are important for manipulating Gen3 data in Terra and are discussed in more detail in Section 4. 

Selecting Gen3 nodes allows you to explore the available metadata fields

For example, selecting "Laboratory Results" (in the above image) will open a table (see the TOPMed dataset example below) of all the possible metadata fields that fall in the node, as well as descriptions of each field. Note: Some metadata fields are optional. The "Required" column of the table identifies those fields that are required if that node is available for a project. For example, the subject's property is required because this holds the UUIDs discussed above in 2.2.

Gen3_node_UUID.png

Remember - not every study will have all the clinical nodes you see in the graph. The graph represents all metadata available for the entire BioData Catalyst project.

Gen3 metadata may be harmonized

Harmonized phenotypic data variables, which are consistently defined across BioData Catalyst projects, enable unified analyses across multiple projects. The TOPMed Data Coordinating Center selects variables for harmonization with the assistance of study and phenotype experts. You can identify harmonized variables by the "HARMONIZED" tag in the "Description" column of each Gen3 metadata table. Unharmonized metadata and other various genomic files (such as multisample VCFs) from a project can be found in the "Reference File" node that is connected by a black line below the "Project" node in the Gen3 dictionary. 

How to access and filter Gen3 data and export to Terra

To find the data you need for your Biodata Catalyst analysis, explore the metadata fields on the Gen3 platform using the directions above. Once you have your data, you can filter and export the Gen3 data to a Terra workspace following the steps below. 

Accessing Gen3 data: Step-by-step instructions

1. Navigate to the Gen3 platform

See https://gen3.biodatacatalyst.nhlbi.nih.gov/or via Terra's Data Library.

2. In the top menu bar, select the Exploration tab.

On the left-hand side is the option to filter data using three specific metadata entities: Medical History, Diagnosis, and Subject. 

Exploration_Tab.png

3. Filter to select the data you want.

  • Select the Subject tab and scroll down to the Project id for your project of interest.

    Poject_ids.png

    To further filter by some of the harmonized variables, select the "Medical History" or "Diagnosis" tabs and filter. This can help you narrow your results to specific individuals within a project that matches your filter. 

  • To filter by some of the harmonized variables (helpful in identifying which projects have variables of interest to you) you can skip selecting a Project and go straight to the "Medical History" or "Diagnosis" tabs and filter by some of the harmonized variables. 

4. Click on the checkbox to the left of the Project ID

After a slight delay (~10 seconds), an "Export All to Terra" option will become available in the top menu of the platform. You can use this button to export all data for the entire project.  

Export_all_to_Terra.png

5. After filtering, select the Export All to Terra icon

An export in progress banner will appear at the bottom of the page, as well as a progress circle in the middle. This can take a few minutes. When the export is complete, you will be redirected to the Terra website. A prompt will ask you to select an existing workspace or to create a new workspace.

Import_to_Terra.png

6. Choose a destination workspace or create a new one.

Covering costs and protecting data in a new workspaceIf you're setting up a workspace for the first time, you need to assign a Billing project to cover Google Cloud charges incurred in the new workspace.

If the data are controlled, you need to set up an Authorization Domain that includes those users who have the authorization to access the data. Note: You need to assign an Authorization Domain before creating the workspace.

To learn more about billing projects and authorization domains, see these Terra articles on billing information and authorization domains. 

Template workspaceYou can select from template workspaces tailored to different Gen3 data analysis needs. The number and variety of template workspaces are growing.

Current template workspaces

  • BioData Catalyst GWAS 1000 Genomes Tutorial
  • TOPMed Aligner Gen3 Data
  • BioData Catalyst GWAS blood pressure trait

7. Select the Import icon

You will be directed to the destination Terra workspace. There will be an import notification in the upper right of the workspace. 

When will data appear?Terra data imports occur asynchronously, and data will appear in the data tab gradually within a few minutes. A new notification will appear when the asynchronous import is complete. To see the newly imported data, refresh the Data tab.

How Gen3 data is organized in Terra

When you export a project's data to a Terra workspace, the available Gen3 metadata are stored as tables in the Data tab. Each node in the Gen3 Data graph is its own separate "entity" table. For example, the Gen3 platform's "Laboratory Results" node (diagram below left)  is exported to a Terra data table labeled "lab_result" (diagram below right).

Metadata_to_Terra.png

Explore data

To view available data, click on one of the data tables to open it, and look closely at the column headers. The table's first column has a metadata ID header composed of the metadata (node) name followed by "_id". For example, the first column header in the "lab_result" table is "lab_result_id". Similarly, the first column header in the "subject" data table is labeled "subject_id".  Each table row has its own unique metadata-specific UUID, which can connect different hierarchical levels of metadata. 

Each additional column contains a project-specific, Gen3 metadata field under the associated Gen3 node, in alphabetical order by headers. For example, in the lab_results table, the column headers contain metadata fields listed in the Gen3 platform's "Laboratory Results" node (shown below), such as "hdl" and "ldl":

Understanding-Gen3-data_lab-results-table_Screen_shot.png

Note the "pfb" namespace prefix in the column headers

To learn more about how Terra uses the pfb namespace to increase interoperability, and how to run workflows on data in a data table with namespaces, see Data table attribute namespace support (pfb prefix)

Note about available columns

  • Not all projects listed in the Gen3 platform have data available for each metadata field. When you export a project to a Terra workspace, data tables include only the metadata fields available for the exported project. 
  • The first column for Gen3 data in Terra is a metadata UUID. For example, the subject_id in the "subject" table is not the same as the Subject identifier provided by TOPMed (which is the "submitter_id" - see below for more details on this!). The TOPMed subject ID you are used to can be quite far along to the right end of the table.  
  • Because there can be many fields under a node, the data table can be quite long, even though you may only use a few of the fields in an analysis.metadata_fields.png

Connecting Gen3 data in different workspace data tables

1. Connecting Gen3 data with metadata UUIDs

Metadata UUIDs can link phenotypic data to genotypic data - which are on different node levels on the Gen3 platform. When linking data, remember that Gen3 metadata nodes are hierarchical. All metadata related to the same immediate upstream node in the Gen3 platform share UUIDs for the upstream node in their Terra data table. In the sample Gen3 tree below, you can see that the Subject and Publication nodes share the Study UUID, while Diagnosis, Lab result, and Demographic share the Subject UUID. To link the Lab result to the study requires knowing both the Subject UUID and the Study UUID associated with that Subject. 

Tree_diagram4.png

2. Connecting Gen3 data with submitter_ids

Another way to link BioData Catalyst biospecimen and genotypic data back to phenotypic data is by using the "submitter_id" column in the Terra data tables. Note: Because they are in alphabetical order, the "submitter_id" column can be far to the right in the table, and you may have to scroll to find it!

The submitter_id property is the calling card/nickname/alias for a unit of submission from its original source. For example, in the TOPMed project, there are aliases given to identify an individual's phenotypic and genotypic data. At the hierarchy of clinical nodes in the Gen3 graph, all clinical nodes will contain the phenotypic alias in the submitter_id column (for example, "DBG..." in the image below).

To connect an individual's phenotypic data to their genotypic data, you will go through the "Sample" biospecimen node in the Gen3 graph. The submitter_id in this node will hold the sequencing ID for a participant (the "HG..." in the image below).

lab results table - submitter_id

sample table - submitter_id
Understanding-Gen3-data_Lab-reswults-table-submitter-id_Screen_shot.png Understanding-Gen3-data_Sample-table-submitter-id_Screen_shot.png

TOPMed Researchers: "subject_id"versus "submitter_idThe subject_id for Gen3 data in Terra is a metadata UUID, and is not the same as the Subject identifier provided by TOPMed. 

TOPMed researchers may be familiar with the terms "subject ID" to identify a study's
participant ID or "NWD ID" to identify a participant's sequencing data. In the Gen3
platform, these IDS are listed in the submitter_id field. 

Resources for organizing and analyzing Gen3 data in Terra

BioData Catalyst/Gen3-focused Showcase and Tutorial Workspaces 

To get hands-on experience working with Gen3 data, see the Showcase Workspaces related to Gen3 data (search for "Gen3" using Control + F in the Showcase Workspaces page). These walk you through different tools for interacting with Gen3 data, including interactive Jupyter Notebooks and batch pipelining workflows. 

  • TOPMed Aligner Gen3 Data
    Workspace setup to use TOPMed aligner with data provided by the Gen3 system

    BioData Catalyst GWAS blood pressure trait 
    Workspace setup to perform a Genome-Wide Association Study (GWAS) for the blood pressure trait using data provided by the Gen3 system

Have additional questions or feedback? Please contact us!

Explore Terra Support for more general questions on using data and running workflows in Terra.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.