hg19 and hg38 TCGA workspaces

Post author
Sabrina Camp

Hi!

I was wondering if there is any documentation on the curation of the hg19 and hg38 controlled-access TCGA workspaces.

I have found some samples in the GDC legacy archive that aren't listed in the hg19 workspace. Are there samples not included due to bad QC metrics? Were they not included because they weren't sequenced at the broad? 

 

All the best,

Sabrina Camp 

Comments

13 comments

  • Comment author
    Sabrina Camp

    Also on another note, I am trying to use bam files from the hg38 cohort and see this note on the dashboard of the workspace "hg38 TCGA and TARGET workspaces reference files by their GDC UUIDs. In order to run analyses on the referenced data files, you will need to run workflows that retrieve the files from the GDC and copy them to your workspace bucket. See this forum post for instructions on the running of these workflows." 

    The link to the forum post (https://gatkforums.broadinstitute.org/firecloud/discussion/10382/populating-hg38-tcga-and-target-workspaces-with-data-files#latest) is broken. Is there an updated link? 

    0
  • Comment author
    Samantha (she/her)

    Hi Sabrina Camp,

     

    Thanks for writing in. Unfortunately, we do not have any documentation on the curation of the hg19 and hg38 controlled-access TCGA workspaces. The CGA team did the original pull of the data, but have since given up ownership of the workspaces. It seems that getting the hg19 data which is now in legacy archive was really complicated to parse programmatically because metadata was not always homogeneously present. It may just be a result of QC metrics or something else along the lines of not enough or not correctly formatted metadata.

    To your second question, the link to the forum post points to our legacy GATK forum which is now defunct. We do not have an updated link, but are working on new documentation for these hg38 and hg19 workspaces. I will be sure to keep you updated on that.

    Please let me know if you have any questions.

     

    Best,

    Samantha

    0
  • Comment author
    Maha Shady
    • Edited

    Hi Samantha,

    I wanted to follow up on this thread because I also need to use some hg38 bams. Are there any updates regarding workflows to retrieve those files from the GDC portal? The documentation link on the hg38 workspaces is still broken, and the columns do not include bam files. 

    Thank you!

    0
  • Comment author
    Emil Furat
    • Edited

    Hi Maha,

     

    Thanks for writing in. I'm reaching out on behalf of the Terra support team to help with answering your question. If you want to retrieve data from the GDC portal you'll need to update the GDC UUIDs in your data tables to be DRS URIs which you can then access from your notebook and/or workflow. To do so you will need to:

    Please note: You must have your NIH account + CRDC Framework Services linked to your Terra account to access the TCGA DRS data.

    If you have any other questions please let us know!

     

    Kind regards,

    Emil

    0
  • Comment author
    Maha Shady

    Hi Emil, 

    Thank you for the clarifications, I followed the steps you mentioned. However, I continue to have the problem that the the data tables do not include any columns for bam files. How can I modify the workspace to also be able to retrieve bam files from the GDC portal?

    Thanks so much!

    0
  • Comment author
    Emil Furat

    Hi Maha,

     

    Could you please share with me the TCGA workspace that you are using? Not your workspace specifically, but the original workspace that you cloned to create your workspace. 

     

    Kind regards,

    Emil

    0
  • Comment author
    Maha Shady

    Sure, I was using this one: https://app.terra.bio/#workspaces/broad-firecloud-tcga/TCGA_SKCM_hg38_ControlledAccess_GDCDR-12-0_DATA 

    Thanks!
     
    Maha
    0
  • Comment author
    Emil Furat

    Hi Maha,

     

    Sorry for the delay getting back to you, our support staff has lost access to TCGA protected workspaces so we are unable to troubleshoot your issue at this time. We hope to regain access and get back to you with a solution as soon as possible.

     

    Kind regards,

    Emil

    0
  • Comment author
    Emil Furat

    Hi Maha,

     

    We still have yet to be given TCGA access, let's see if we can find another solution.

     

    Just to confirm:

    Are you able to find the column containing the GDC UUIDs in the original version of the workspace you are using?

    The data model update script should have found the columns in your data table automatically, can you confirm if the data model update notebook ran successfully?

    Could you send a screenshot of your data table containing all the headers and a row or two of the contents? 

     

    Kind regards,

    Emil

    0
  • Comment author
    Maha Shady

    Hi Emil, 

    Thanks for following up! I do think the update notebook ran successfully. I think the problem is that that original workspaces itself doesn't have a column for bams. Here's a screenshot of my participants data table, the other data tables have too many columns to fit in a screenshot.

     

    Thank you!

    Maha

    0
  • Comment author
    Emil Furat

    Hi Maha,

     

    We have regained our TCGA access and were able to take a closer look at your workspace, we can confirm that there is not a column for BAM files anywhere in your data tables. We also noticed that on the dashboard of your workspace the Data File Formats row does not contain BAM anywhere:

     

    It seems as though TCGA workspaces define what types of files are available for the chosen cohort, and that other TCGA cohorts may have BAM files available. For example, the following workspace has BAM listed under the Data File Formats: https://app.terra.bio/#workspaces/broad-firecloud-tcga/TCGA_SKCM_ControlledAccess_V1-0_DATA 

    If you have any other questions please let us know.

     

    Kind regards,

    Emil

    0
  • Comment author
    Emil Furat

    Hi Maha,

     

    We haven't heard from you in a couple of days so I just wanted to check-in and see how things were going. Have you been able to find a TCGA workspace that suits your needs?

    If you have any other questions please let us know!

     

    Kind regards,

    Emil

    0
  • Comment author
    Maha Shady

    Hi Emil, 

    Thanks for following up. Yes, the hg38 workspaces do not have bams included, only the hg19 workspaces do. I will reassess the needs for my project and may end up just using the hg19 files. Will get back in touch if I continue needing to use the hg38-aligned bams.

    Thank you for your help!

    Best, 

    Maha 

    0

Please sign in to leave a comment.