Heads up if you use the TARGET and/or TCGA data workspaces: we're planning an update that will change the paths to the data files. This will disable access to the original files in any clones of the original workspaces that retain the old paths, and it will affect call caching in any new clones of the original workspaces that you create after the update rolls out in a few weeks.
Read on to understand what this change entails and why it's an important improvement that is worth making.
The file paths will have a new URL structure. After the update, file paths will no longer start with "gs://" followed by a bucket path. Instead, they will follow this structure: "dos://<UUID>".
What are the consequences for your work?
- If you are working in clones of the original master workspaces, the original data files will no longer be available at the old "gs://" paths. So if you want to run new analyses in those previously existing workspaces, you will need to update the metadata to use the new "dos://" paths. However, any output files you previously derived from analyzing those datasets will be unaffected and will remain accessible.
- The first time you run workflows on the data using the new paths, in any workspace (old or new), call caching will NOT kick in even if you have previously run those workflows on the same data. The workflows will have to run in full; this is because call caching uses the file paths as part of the algorithm that identifies whether a given computation has already been run with the same starting conditions.
Why are we making this change?
We understand that this update has the potential to disrupt your work, so please rest assured we are not making this decision lightly. In a nutshell, the switch in file path structure brings substantial benefits that we believe are worth risking some disruption.
In the past, the Genomic Data Commons (GDC) only released new datasets a few times a year, but the frequency at which we receive data has been increasing steadily, which has been making it more difficult to manage the new content and make it available in a timely way. This update will enable us to respond more efficiently and in a standardized manner to new datasets released by the GDC. As a result, you can expect to see updates to TARGET and TCGA workspaces happen much more quickly than in the past, and we'll also be able to provide access to new GDC datasets from our platform as they become available.
For context, this move is part of larger shift within the GDC towards location-agnostic URLs, which allow the physical data to relocate without changing references to those data. This follows changes in standards of how datasets are stored across the Data Commons Framework (DCF), which we expect will provide a more standardized and streamlined experience going forward. We are keen to align with the GDC on this effort and we look forward to seeing its benefits materialize for everyone.