How to access TCGA Data from GDC in Firecloud

Gabriella Senior
  • Updated

TCGA controlled- and open-access workspaces include a variety of - but not all - data types pre-loaded into the workspace. To access a data type that is not pre-loaded, you will go to the data commons website of interest, select samples / files / etc and export a manifest, then use that manifest to import data to Terra via DRS URI following the steps below. Note that these instructions use the GDC website as an example. 

Save time and money

This may seem challenging to setup but, once you do, you're saving countless hours (and $$$) by not temporarily copying data from GDC to Terra.

Access to all datasets on the GDC portal

This approach also opens up all datasets on the GDC data portal, not just the open access CPTAC data in this tutorial. You can use this approach for any open access data as well as any datasets where you have dbGaP approved access.

At the time of writing this tutorial, GDC has ~1.6PB of data available! For details, see the table below

Header Value
Dataset Size 1.57 PB
Programs 16
Projects 67
Primary Sites 68
Cases 84,392
Files 596,758

Table 1: GDC statistics as of March 2021

Linking authorization is required even for open-access dataTo access any DRS URI file, you will need to link your Terra account with NCI CRDC Framework services using your eRA Commons account in the External Identities section of the Terra Profile page. See Access controlled data files by linking your NIH account in Terra

1. Select and download data of interest

 1.1. Go to the GDC website cohort builder and/or repository tabs and select data of interest. You can select data from multiple sources. See sources and examples below. 

  • From the cohort builder

    Example: General -> Project -> TCGA-UCS1-GDC-Dataportal .png

  • From the repository

    Example: Access -> open access2-GDC-Dataportal.png

      3-GDC-Dataportal.png

  • From the repository

    Example: Data category -> copy number variation4-GDC-Dataportal.png

1.2. Click the cart icon at left to add desired files to the cart.

5-GDC-Dataportal.png

1.3. When you've added all the files you want, go to the cart (top of screen).

6-GDC-Dataportal.png

 1.4. Download the Manifest by clicking the Download card and selecting Manifest from the dropdown. 

8-GDC-Dataportal.png

What to expect

Your TSV manifest will look like this.

id    filename    md5    size    state
1d50ef40-b726-48f7-b81d-ae0e4dab714b    d9124538-347c-483e-aee9-83c462e87976.FPKM-UQ.txt.gz    04d6c247dada7cf0dad93a839f6b7437    438083    released
4b1c6ee1-b46a-4b9d-bb74-d303719c729f    dc383158-c7c8-4fa5-a0fc-f9b2f9d619e3.FPKM-UQ.txt.gz    b9dbd7f7b417cdc8b978f708508dd7cb    442855    released
f4f165ef-15d1-4cf4-909b-0c9b80b295c4    5a5e2ef7-89c2-455d-8d11-84b86fed0b7b.FPKM-UQ.txt.gz    0307ce18caefee845629b53bb37181d0    445557    released
7b30dc0f-017d-42de-8ec7-d9748add2c9c    077d2d30-e631-4f36-9832-04a7f1f451b6.FPKM-UQ.txt.gz    8e4fc7054339d05062c06d797bdd376d    446449    released

1.5. Download associated data by going to Download Associated Data and selecting Sample Sheet.

7-GDC-Dataportal.png

2. Transform the manifesto to Firecloud/Terra format

Terra/FireCloud doesn't recognize this manifest format-- in this step it will be transformed into a format that is usable in Terra/FireCloud. Instead of doing this manually, use the notebook in this workspace to do this for you. You can also follow instructions in the CRDC workspace to update the manifest to DRS URIs.    

This step makes the above manifest file from step 1 look like below.

entity:drs_id    drs_uri    filename
1d50ef40-b726-48f7-b81d-ae0e4dab714b    drs://dg.4DFC:1d50ef40-b726-48f7-b81d-ae0e4dab714b    d9124538-347c-483e-aee9-83c462e87976.FPKM-UQ.txt.gz
...

2.1. Start a Jupyter Environment

 a. Click on the environment configuration cloud icon.

Env-Conf-Cloud.png

 b. Click on the Settings button (gear icon).

 Jupyter-Settings.png

c. Click on the blue Create button. 

Create-Jupy-Env.png

2.2. Note that it will take a few minutes for your Cloud Environment to spin up. Navigate to the Data tab of your workspace. 

Datatab-workspace.png

2.3. Click on the Files icon in the right-hand column.

Filestab.png

2.4. Click the icon on the bottom right and upload the manifest file you downloaded from GDC to your workspace. 

Upload-file.png

2.5. Navigate to the Analyses tab of your workspace and run the Upload GDC Manifest to Workspace Data table notebook. 

Notebook-tab.png

2.6. Follow the instructions in the notebook, making sure to run each cell in order, from top to bottom.

3. Run a Test Workflow

Now that you have a data table (drs) created and loaded with a few GDC DRS URIs (and other info) you can run a workflow to test that everything is working properly. Normally, you can just go to Dockstore.org and pick the md5sum WDL workflow. This step has already been done for you as well as binding the input of this.drs_uri in the drs table to the input file of the md5sum workflow.

Don't run on all data when testingIf you are following this guide to test your data access and to learn how to take a search result from GDC and work with it in FireCloud, you do not need to run the md5sum workflow on all CPTAC data!! Before you launch the workflow choose "SELECT DATA", then choose "Choose specific rows to process", and select the first row. That will just run md5sum on a single file and should be sufficient to test your data access.

If you run this workflow on your drs table and it finishes successfully, congratulations, you have accessed the GDC data you found in the GDC portal through the DRS standard in Terra.

4. Run your own analysis

If the workflow from step 3 runs successfully on one of the data files from the drs table then you should be ready to use this technique with whatever data you're interested in from the GDC portal and with your own analysis workflows.

You can choose to use this workspace and add additional data by searching in the GDC portal, downloading the manifest, and running the notebook again to load it.

Alternatively, if you want a clean copy to start your analysis work in, you can delete this workspace you used for the tutorial and clone the original workspace again, this time uploading a manifest from GDC that corresponds to the search you are interested in.

Notebook details

More information about the notebook Upload GDC Manifest to Workspace Data table.

What does it do?

Converts a GDC manifest file to a Terra-readable TSV that can be uploaded to a workspace data table.

Runtime  Value 
Environments Default (GATK 4.1.4.1 Python3.7.8)
CPU Minimum  2
Disksize Minimum 10 GB
Memory Minimum 15 GB

Workflow details

More information about the workflow: dockstore-wdl-workflow-md5sum.

What does it do?

This is an simple workflow used to show how to call a workflow via Dockstore.

What does it require as input?

  • inputFile (the file that will have its md5sum retrieved)

What does it return as output?

  • Value (the md5sum value)
File Name  Time  Cost $
htseq_counts.txt.gz 4m  <0.01

License 

Copyright Broad Institute, 2021 | BSD-3 All code provided in this workspace is released under the WDL open source code license (BSD-3) (full license text at https://github.com/openwdl/wdl/blob/master/LICENSE ). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.