Accessing GTEx data when running workflow

Post author
Tanya Phung

I have access to GTEx version 8 data on google cloud. When I'm running the haplotypecaller-gvcf-gatk4 workflow, I got an error that says: Bucket is requester pays bucket but no user project provided. How to I specify this? 

In my Tables, I included the following:

gs://bucket_name/GTEx_Analysis_2017-06-05_v8_WES_BAM_files/bam_file. 

I know that I'm missing the part to specify the user project but I'm not sure how to do this.

Thank you for the help. 

Comments

15 comments

  • Comment author
    Q DI

    I am having the exact same problem.  It appears that I can download directly from the Google Cloud Platform console, but using the command line utility gsutil, I get the following:

    gsutil cp gs://fc-secure-ff8156a3-ddf3-42e4-9211-0fd89da62108/GTEx_Analysis_2017-06-05_v8_RNAseq_BAM_files/GTEX-1117F-0226-SM-5GZZ7.Aligned.sortedByCoord.out.patched.md.bam.bai ./
    BadRequestException: 400 Bucket is requester pays bucket but no user project provided.

    I set up my credentials with gcloud auth list, and I have my account set to the free AnVIL project.  I also had pass_credentials_to_gsutil = True.

     

    Thanks for your assistance

     

    1
  • Comment author
    Sushma Chaluvadi
    • Edited

    Hello Tanya and Q,

    When using the command line, can you try passing in --gcs-project-for-requester-pays with the billing project to be billed to see if that works? 

    I will get back to you about options for when using the Workflow in Terra. 

     

    Sushma

     

    0
  • Comment author
    Q DI

    Hi Sushma,

    Thanks for your response.  I think the option --gcs-project-for-requester-pays is only an option for gatk?  I am using the gsutil command directly as specified on the "File Details" of a bam file on the AnVIL_GTEx_V8_hg38 data page.

    Not sure if this is related, but I see this outstanding issue on the gatk github page that references the same error? https://github.com/broadinstitute/gatk/issues/6179

    Thanks for your help.

    0
  • Comment author
    Tanya Phung

    Hi Sushma, 

    Thank you so much for looking into this. Any updates on how I can access GTEx data when using the workflow in Terra? 


    Thank you!

    Tanya

    0
  • Comment author
    Sushma Chaluvadi

    Q DI,

    When you run your gsutil cp command can you try adding the -u parameter followed by your google billing project as follows:

    gsutil -u [billing-project] cp gs://fc-secure-ff8156a3-ddf3-42e4-9211-0fd89da62108/../GTEX-1117F-0226-SM-5GZZ7.Aligned.sortedByCoord.out.patched.md.bam.bai ./

    Passing in your billing project with the -u parameter should allow you to download files with gsutil when the bucket has requester-pays enabled.

    0
  • Comment author
    Sushma Chaluvadi

    Tanya,

    Apologies for the delay! I spoke with some of our WDL developers and they said that the way to circumvent this is to manually modify the tasks in the WDL that are localizing files from the requester-pays bucket. You would need to pass in the --gcs-project-for-requester-pays parameter so that the command knows to accept the billing-project you want to bill for accessing files in the requester-pays bucket.

     

    I am currently working to modify one of the tasks in the haplotypecaller WDL for you to test out! I will be in touch shortly!

    0
  • Comment author
    Tanya Phung

    Hi Sushma, 

    Thank you so much for the help. 

    Best,
    Tanya

    0
  • Comment author
    Q DI

    Hi Sushma,

    Thanks much, the -u parameter was exactly what I needed!

    Just a suggestion, it may be good to indicate the existence of that parameter here: https://support.terra.bio/hc/en-us/articles/360029251091-Broad-Genomics-Downloading-data-from-a-Terra-workspace#gsutildownload

    Unfortunately, none of the Google gsutil documentation mentions this parameter (or at least if they do, I have not been able to find it :) )

    Thanks again!

    0
  • Comment author
    Sushma Chaluvadi

    Q DI,

    I will let our team know to add this information! For reference on this thread, here is the Google documentation on the -u parameter: https://cloud.google.com/storage/docs/using-requester-pays#using

     

    Sushma

    0
  • Comment author
    Q DI

    Thanks for the Google documentation link!

    0
  • Comment author
    Tanya Phung

    Hi Sushma, 

    I just wanted to check in to see if there is a test version of the Haplotype caller WDL to allow me to specify the billing project. Thank you so much for your help. 

    Best,
    Tanya

    0
  • Comment author
    scalvo

    Hi Sushma,

    I encountered the exact same problem trying to access GTEx data from a Terra workflow.  You suggested "modify the tasks in the WDL that are localizing files from the requester-pays bucket. You would need to pass in the --gcs-project-for-requester-pays parameter so that the command knows to accept the billing-project you want to bill for accessing files in the requester-pays bucket."  Could you please give an example of a WDL fragment that uses a billing project? 

    Here is the fragment of my WDL that is failing (the input_bam and input_bai parameters refer to files in the Requester Pays bucket):

     

    Float ref_size = if defined(ref_fasta) then size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB") else 0
    Int disk_size = ceil(size(input_bam, "GB") + ref_size) + 20

    meta {
    description: "Subsets a whole genome bam to just Mitochondria reads"
    }
    parameter_meta {
    ref_fasta: "Reference is only required for cram input. If it is provided ref_fasta_index and ref_dict are also required."
    input_bam: {
    localization_optional: true
    }
    input_bai: {
    localization_optional: true
    }

    }

    command <<<
    set -e
    export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

    gatk PrintReads \
    ~{"-R " + ref_fasta} \
    -L ~{contig_name} \
    --read-filter MateOnSameContigOrNoMappedMateReadFilter \
    --read-filter MateUnmappedAndUnmappedReadFilter \
    -I ~{input_bam} \
    -O ~{basename}.bam
    >>>
    runtime {
    memory: "3 GB"
    disks: "local-disk " + disk_size + " HDD"
    docker: "us.gcr.io/broad-gatk/gatk:4.1.1.0"
    preemptible: select_first([preemptible_tries, 5])
    }

     

    Thanks!

    Sarah

    0
  • Comment author
    Jason Cerrato

    Hi Sarah,

    Here is the relevant information for the gatk PrintReads tool: https://gatk.broadinstitute.org/hc/en-us/articles/360037592891-PrintReads

    So your command block should look something like:

    command <<<
    set -e
    export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

    gatk PrintReads \
    ~{"-R " + ref_fasta} \
    -L ~{contig_name} \
    --gcs-project-for-requester-pays <project_to_bill> \
    --read-filter MateOnSameContigOrNoMappedMateReadFilter \
    --read-filter MateUnmappedAndUnmappedReadFilter \
    -I ~{input_bam} \
    -O ~{basename}.bam
    >>>

    Here is another example:

    I hope this helps!

    Kind regards,

    Jason

    0
  • Comment author
    Joe Brown

    What does one do when not using GATK and needs to access these GTEx data in a requester pays bucket?

    0
  • Comment author
    Jason Cerrato

    Hi Joe Brown,

    If you are accessing data from a requester pays bucket using a workflow in Terra, Cromwell (the workflow management system) will automatically bill your billing project for the access to the resources. You can read a little more about this here: https://cromwell.readthedocs.io/en/stable/filesystems/GoogleCloudStorage/#requester-pays

    In cases where one uses a localization-optional option with GATK, they would need to specify their project using the aforementioned flag.

    Kind regards,

    Jason

    0

Please sign in to leave a comment.