Data table attribute namespace support (pfb prefix)

Allie Hajian
  • Updated

All data table attributes created by handing-off data from Gen3 to Terra now include a namespace prefix. This document explains what the changes are, who is impacted, what impacted people need to do, and why the changes were made. 

Compliance when using controlled-access (PFB) dataPFB imports of controlled-access data (including from external repositories) to Terra on Google workspaces will now add the Additional Security Monitoring policy to the workspace. 

New namespace attributes for Gen3 data imported to Terra

To increase interoperability, some data exported directly to a workspace data table after October 21, 2020 looks a little different - attribute names include a namespace prefix. This change currently applies to all data table attributes created by handing off data from Gen3 to Terra. It impacts the attribute name (i.e. the data table header).

Example: submitter_id attribute in data table before/after namespace

Previous column header 
(no namespace)
New column header
(with pfb namespace)
submitter_id pfb: submitter_id
NA19455 NA19455

Greater interoperability with namespaces

This change supports the NIH Cloud Platform Interoperability effort (NCPI). The pfb namespace prefix identifies attributes imported via Portable Format for Biomedical data (PFB). Using the namespace prefix prevents name conflicts and can reduce potential confusion when data comes from multiple sources.

The Portable Format for Biomedical data (PFB) namespace

The Portable Format for Biomedical Data (PFB) - developed at the University of Chicago Center for Translational Data Science as part of their ongoing partnership with the Data Commons - is an efficient and portable way to serialize complex data.  It is used by multiple institutions/programs to exchange biomedical data and is the exchange format of the current NIH Cloud Platform Interoperability effort.

How to use Gen3 pfb namespaces in an analysis

Most of the work of implementing namespaces in Terra - such as making sure workflows and notebooks recognize the prefix - happens behind the scenes. The only difference you will see when running workflows or interactive analyses is the new attribute name (as it appears in column headings in data tables in a Terra workspace) when referencing data attributes.

Note that you will need to use the namespace when analyzing any data handed off from Gen3 after 10-21-2020.  

Example: Terra workflow inputs configuration

If you're running a workflow on Gen3 data exported to a data table, you must include the pfb prefix when configuring a workflow inputs. 

Workflow configuration before and after namespace

Previous attribute (no namespace)

New attribute (with pfb prefix)

Namespace-support_Old-attribute_Screen_shot.png Namespace-support_Attribute-new_Screen_shot.png

Example: Reading data from a Terra data table using FISS

Firecloud Service Selector (FISS) is a Python module that allows API (Application Programming Interface) calls from the notebook to the workspace. Namespace support impacts the formatting when using FISS to read attributes from a data table.

  • Without namespace
    Reading a specific attribute from a data table using FISS was previously implemented as
    response = fapi.get_entities_tsv(BILLING_PROJECT, WORKSPACE, "sample", "submitter_id", model="flexible")

  • With namespace
    The same data would now be implemented as (notice the colon after pfb!)
    response = fapi.get_entities_tsv(BILLING_PROJECT, WORKSPACE, "sample", "pfb:submitter_id", model="flexible")

How to combine data imported from Gen3 before and after October 21, 2020

We recommend against mixing data handed off before the namespace change with data that includes the namespace, as it will leave a messy mix of duplicated data with and without the namespace prefix in different columns of the same data table.

Am I impacted by this change?If you have handed off data from Gen3 to your workspace
prior to the namespace change and will be adding additional data from Gen3 in the future, you will be impacted by this change!

Please contact us if you have any questions or would like assistance with the migration process -- we
are happy to help!

What do you need to do?

Going forward, any old data needs to be migrated to consistently use the namespace prefix. There are two different approaches to ensure all attributes handed off from Gen3 (before and after the change) use the namespace.

Avoid losing data as you migrateWith either approach, first clone your existing
workspace to have the original tables available as a backup if needed!

Option 1: Modify an existing workspace

In this case, you would delete data tables in your workspace created by a previous handoff from Gen3 and hand off data from Gen3 to this workspace again. An advantage to this approach is you would maintain all data in the workspace bucket (such as data generated by a previous workflow analysis or any data you copied into the bucket).

Challenges to consider

The challenge with this approach is that deleting very large tables in Terra is very slow (you have to delete existing data tables with the old attribute name because there is no way to modify the  table column headers). For large tables (a few thousand rows or more), it can take many minutes. Very numbers of very large tables can take hours. Additionally, you cannot use the Terra UI for deleting large tables.

How to delete large data tablesNote that tables can be fairly conveniently deleted using the `terra_data_table_util` notebook if you are willing to wait. This notebook is included in the BioData Catalyst Collection workspace terra_data_table_util.ipynb.

The name of the function to use for deleting all workspace tables is delete_all_gen3_tables.

Option 2: Create a new workspace

Using this option, you would hand off the desired data from Gen3 to a new workspace, then manually migrate any additional data you need from your current workspace to the new workspace.

Avoid losing data in the original workspace bucketRemember that any data you generated in the original workspace will be stored in the original workspace bucket.

If you think you will want any generated data from the original workspace in the future, you will need to make sure to not delete the workspace

Challenges to consider

The challenge with this option is you will need to move data that didn’t come from Gen3 (i.e., data you uploaded to the original workspace bucket as well as any generated data) to the new workspace. However, this may be the simplest, cleanest, and fastest approach, depending on how much of your own data you have in the original workspace data tables and workspace bucket.

Any data you have in the “Cloud Environment” virtual machine or its Persistent Disk will continue to be available in both the current and new workspaces if both workspaces are in the same billing project.

Gen3 users not affected by the namespace change 

If you fit any of the categories below, you do not have to worry about the change to Gen3 data exported to a workspace.

  • Brought all your own data and have not imported any data from Gen3 to your workspace
    This change only applies to data imported via PFB, and everything will continue to work as before.
  • Handed-off data from Gen3 to their workspace prior to 10-21-2020, have all the data you need, and will not be handing-off any data from Gen3 in the future
    Data handed off prior to this change is not affected by this change. Everything will continue to work as it did before, without the namespace prefix.
  • Imported (or will import) Gen3 data to a workspace after the namespace change
    In this case, all the data from Gen3 consistently uses the namespace. Note there may be some issues with the current user content (documentation, tutorial Workspaces, and Notebooks, etc.) until we complete the content migration.

What else is changing related to this?

Updating Terra resources to reflect namespace support

The implementation of namespacing for data imported via PFB was fast-tracked to support the broader NCPI effort. The Terra team is still in the process of making corresponding updates to content in Terra including:

  • User documentation

  • Workspaces

  • Notebooks

  • Python utility libraries

We are working through this review/update process as quickly as we can. If you notice a resource is not working the way it should, please let us know and we will prioritize it appropriately.

Looking forward: Common PFB Attributes

To better support interoperability between systems - i.e., working with data from multiple sources and datasets - the NCPI has defined a small set of PFB Common Attributes to be used consistently across systems. The Gen3 team is in the process of adding these attributes to the Gen3 BioData Catalyst data model. In some cases, these common attributes represent data that is already in the Gen3 BioData Catalyst data model with a different name. In such cases, both the existing name and the new common name will be present, both with the same value. A list of the NCPI PFB Common attributes is available here

Future namespace benefits

Where and how the namespace feature evolves depends on feedback from you, our users.

One possibility is to enable the use of a more unique and descriptive namespace value than “pfb”: the user  could specify the namespace value in the Terra import form as part of the hand off process.

Another option is to make the namespace value the name of the program/portal from which the data came (“bdcat”, “anvil”, etc.) or the name of a specific data model and version, etc. This may help to identify the data when working with it, and could potentially facilitate advanced use cases such as having data from multiple programs/portals and data models in the same Terra workspace, with the data for each namespaced appropriately.

Please let us know what you think, or if you have questions or would like help with migration.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.