Updated hg19 TCGA BAM links causing error in multiple workflows
I have been attempting to work with the tcga hg19 aligned BAMs that have updated paths and have been receiving an error in two different workflows. One workflow uses samtools to get the statistics from the bam index file. The other workflow uses GATK DepthofCoverage to get coverage information from the bam. I've attached a picture of the specific error that I received in both workflows. This is an example of one bam url that has been causing issues -- gs://gdc-tcga-phs000178-controlled/PRAD/DNA/WXS/BI/ILLUMINA/C529.TCGA-ZG-A9NI-10A-01D-A41N-08.3.bam
This error may point to a sync issue for the eRA Commons, DCP, and DCF framework services. Would you be willing to relink eRA Commons, DCP, and DCF to ensure that all of the validation is synced up, and then try running to workflow again? You can find these links in your profile page at https://app.terra.bio/#profile.
I went back and relinked my eRA Commons account (I do not use the DCP or DCF framework services). When I ran the workflow again, I received the same error. Anything else I should try?
Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in either the icon of your workspace in the workspace list or inside the workspace dashboard (see the icon with the three dots). Let us know the workspace name, submission and workflow ids. We'll be happy to take a closer look.
I just shared the workspace with you.
Thanks for that! Since I see that you are working with TCGA hg19 data, I believe you will need to link with the DCF Framework Services. A communication with the subject line "Update Regarding Access to TCGA Datasets (hg19 Version)" was sent out on December 13 with more details. I've copied the contents of the communication below for your convenience.
As a follow up to our previous messages, we wanted to update you on the status of the restoration of access to the HG19 TCGA workspaces and data. We are happy to announce that, as of today, you are now able to access the workspaces and data again.
Over the last several months, we and our collaborative partners have been working to restore access as quickly as possible. To that end, there were several iterations of infrastructure fixes, along with new scripting efforts to maintain the data, that had to be completed. All of which is now fixed.
As you begin utilizing the workspaces again, you’ll notice that most of the URLs are now DRS URLs (they begin with drs://), which is a system that allows the NCI’s Genomic Data Commons (GDC) to relocate physical data without changing the URLs to those data. For the remainder of the URLs, they are Google bucket URLs, which are external buckets that we have direct access to. As a result of the changes with DRS, there are a few important things to note:
In order to access the TCGA data, you will need to link via eRA Commons, in the same that you previously did:
as well as at the bottom where it says “DCF Framework Services by University of Chicago”:
By linking in both places, you gain access to the files that are maintained by GDC via DRS URLs, as well as the files that are located in the Google buckets.
For the HG19 aligned data, which is considered to be legacy, not every file path is present in the GDC. Although we have coverage for the vast majority of the data in the GDC, you may notice a very small portion of files missing. If you do, please submit a ticket, and we will make a request with NCI for the data to be made available within the GDC. However, where possible, we encourage users to switch to HG38 and we cannot guarantee the GDC will make this missing data available.
Streaming of files using NIO, as the GATK supports, is not currently supported with DRS URLs. We are in the midst of planning how tools like GATK will work with DRS URLs.
As you begin using the workspaces again, we are here to support you. You can always submit a ticket or leave a comment on the forum. More than anything, we want to thank you for your consideration and patience as we’ve worked to restore access to the workspace. We’ve had the privilege of interacting with many of you through this process, and are truly grateful for your understanding.
The Terra Team
Customer Success | Data Sciences Platform
Broad Institute of MIT and Harvard
You can also link the DCP Framework if you so desire for the sake of completeness. All three will bring you to the same portal and use the same credentials. If you continue to run into the same error, please provide a screenshot of your profile page confirming the links and we'll investigate further. If you have any questions, please let us know.
Please sign in to leave a comment.