All data table attributes created by handing-off data from Gen3 to Terra now include a namespace prefix. This document explains what the changes are, who is impacted, what impacted people need to do, and why the changes were made.
New namespace attributes for Gen3 data imported to Terra
How to use new namespace attributes in an analysis
- Example: Terra workflow input parameter configuration
- Example: Reading data from a Terra data table using FISS
Gen3 data users impacted by new Gen3 data namespaces
Why are there new namespaces for Gen3 data?
How could these changes benefit me in the future?
New namespace attributes for Gen3 data imported to Terra
To increase interoperability, some data exported directly to a workspace data table after October 21, 2020 looks a little different - attribute names include a namespace prefix. This change currently applies to all data table attributes created by handing off data from Gen3 to Terra. It impacts the attribute name (i.e. the data table header).
Example: submitter_id attribute in data table before/after namespace
No namespace (previous attribute)
With pfb namespace (new attribute)
Greater interoperability with namespaces
This change was made to support the NIH Cloud Platform Interoperability effort (NCPI). The
pfb namespace prefix identifies attributes imported via Portable Format for Biomedical data (PFB). Using the namespace prefix prevents name conflicts and can reduce potential confusion when data comes from multiple sources.
The Portable Format for Biomedical data (PFB) namespace
The Portable Format for Biomedical Data (PFB) - developed at the University of Chicago Center for Translational Data Science as part of their ongoing partnership with the Data Commons - is an efficient and portable way to serialize complex data. It is used by multiple institutions/programs (including SOME EXAMPLES HERE) to exchange biomedical data and is the exchange format of the current NIH Cloud Platform Interoperability effort.
How to use new Gen3 namespaces in an analysis
Most of the work of implementing namespaces in Terra - such as making sure workflows and notebooks recognize the prefix - happens behind the scenes. The only difference you will see when running workflows or interactive analyses is the new attribute name (as it appears in the Terra UI table column headings) when referencing data attributes. Note that you will need to use the namespace when analyzing any data handed off from Gen3 after 10-21-2020.
Example: Terra workflow input parameter configuration
If you're running a workflow on Gen3 data exported to a data table, you must include the
pfb prefix when configuring a workflow inputs.
Workflow configuration before and after namespace
Previous attribute (no namespace)
New attribute (with pfb prefix)
Example: Reading data from a Terra data table using FISS
Firecloud Service Selector (FISS) is a Python module that allows API (Application Programming Interface) calls from the notebook to the workspace. Namespace support impacts the formatting when using FISS to read attributes from a data table.
- Without namespace
Reading a specific attribute from a data table using FISS was previously implemented as
response = fapi.get_entities_tsv(BILLING_PROJECT, WORKSPACE, "sample", "submitter_id", model="flexible")
- With namespace
The same data would now be implemented as (notice the colon after pfb!)
response = fapi.get_entities_tsv(BILLING_PROJECT, WORKSPACE, "sample", "pfb:submitter_id", model="flexible")
Combining data imported from Gen3 before and after
namespace change (October 21) - what you need to know
We recommend against mixing data handed off before the namespace change with data that includes the namespace, as it will leave a messy mix of duplicated data with and without the namespace prefix in different columns of the same data table.
If you have handed off data from Gen3 to your workspace
Please contact us if you have any questions or
What do you need to do?
Going forward, any old data needs to be migrated to consistently use the namespace prefix. There are two different approaches to ensure all attributes handed off from Gen3 (before and after the change) use the namespace.
With either of these approaches, first clone your existing
Option 1: Modify existing workspace
Challenges to consider
The challenge with this approach is that deleting very large tables in Terra is very slow (you have to delete existing data tables with the old attribute name because there is no way to modify the table column headers). For large tables (a few thousand rows or more), it can take many minutes. Very numbers of very large tables can take hours. Additionally, you cannot use the Terra UI for deleting large tables.
How to delete large data tables
Note that tables can be fairly conveniently deleted using the
The name of the function to use for deleting all workspace
Option 2: Create a new workspace
Don't lose data in the original workspace bucket
Remember that any data you generated in the original
If you think you will want any generated data from the
Challenges to consider
The challenge with this option is moving data that didn’t come from Gen3 (i.e. data you brought to the original workspace bucket and generated data) to the new workspace. However, this may be the simplest, cleanest, and fastest approach, depending on how much of your own data you have in the original workspace data tables and workspace bucket.
Any data you have in the “Cloud Environment” virtual machine or its Persistent Disk will continue to be available in both the current and new workspaces if both workspaces are in the same billing project.
Gen3 users not affected by the namespace change
If you fit any of the categories below, you do not have to worry about the change to Gen3 data exported to a workspace.
- Brought all your own data and have not imported any data from Gen3 to your workspace
This change only applies to data imported via PFB, and everything will continue to work as before.
- Handed-off data from Gen3 to their workspace prior to 10-21-2020, have all the data you need, and will not be handing-off any data from Gen3 in the future
Data handed off prior to this change is not affected by this change. Everything will continue to work as it did before, without the namespace prefix.
Imported (or will import) Gen3 data to a workspace after the namespace change
In this case, all the data from Gen3 consistently uses the namespace. Note there may be some issues with the current user content (documentation, tutorial Workspaces, and Notebooks, etc.) until we complete the content migration.
What else is changing related to this?
Updating Terra resources to reflect namespace support
The implementation of namespacing for data imported via PFB was fast-tracked to support the broader NCPI effort. The Terra team is still in the process of making corresponding updates to content in Terra including:
- User documentation
- Python utility libraries
We are working through this review/update process as quickly as we can. If you notice a resource is not working the way it should, please let us know and we will prioritize it appropriately.
Looking forward: Common PFB Attributes
To better support interoperability between systems - i.e. working with data from multiple sources and datasets - the NCPI has defined a small set of PFB Common Attributes to be used consistently across systems. The Gen3 team is in the process of adding these attributes to the Gen3 BioData Catalyst data model. In some cases, these common attributes represent data that is already in the Gen3 BioData Catalyst data model with a different name. In such cases, both the existing name and the new common name will be present, both with the same value. A list of the NCPI PFB Common attributes is available here.
Future namespace benefits
Where and how the namespace feature evolves depends on feedback from you, our users!
One possibility is to enable the use of a more unique and descriptive namespace value than “pfb”: the user could specify the namespace value in the Terra import form as part of the hand off process.
Another option is to make the namespace value the name of the program/portal from which the data came (“bdcat”, “anvil”, etc.) or the name of a specific data model and version, etc. This may help to identify the data when working with it, and could potentially facilitate advanced use cases such as having data from multiple programs/portals and data models in the same Terra workspace, with the data for each namespaced appropriately.
Please let us know what you think!
Thank you, and again, please let us know if you have questions or would like help with migration!