Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Interval problem with haplotypecaller-gvcf-gatk4 (suggested workflow

Comments

11 comments

  • Avatar
    Jason Cerrato

    Hi Matthew,

    Can you share your interval file and let us know how it was made? Can you confirm that it fits one of the mandatory formats listed in this document?

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Jason:  I accidentally responded to a different post. I made one interval list as a hand made list. It runs fine locally on my computer with the -L flag.

    Here is the contents of the GATK-style list:

    CM012145.1
    CM012146.1

    This works on my local GATK. That's why I think it is a Cromwell issue.

     

    I made a second list using  ScatterIntervalsByNs 

    Here are a few lines of that file. I'm realizing that the "UR:" field is wrong for Terra:

    @HD VN:1.6 SO:coordinate
    @SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012115.1 LN:151342139 M5:c184c92dc94c91404739ecf6f6899ced UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012116.1 LN:114810999 M5:4457415245bdf6f612e1706f51bec42b UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012117.1 LN:18597117 M5:ab515dae2e2b4fcd20a485c0f8116503 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012118.1 LN:16645885 M5:ebc880ff0aed1141f631443972eca55b UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012119.1 LN:35401958 M5:2de3cd2e5ca23c172fb2fbeeda66314e UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012120.1 LN:39139214 M5:62d596d414b8601f0be908d3d123cac8 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012121.1 LN:31090148 M5:cafc8a27b387908e1e25a89810ea90c4 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012122.1 LN:25686456 M5:fdfc8d2d78653af8228f36943d3ae47a UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012123.1 LN:22664390 M5:2dd29b7e52af7e0e6ddf5baa822d58f9 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012124.1 LN:20302349 M5:718cae94ea134a321b8d3749cebf0934 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012125.1 LN:21352500 M5:b9f95aa6419b1012081a422f024c0456 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012126.1 LN:17696115 M5:bd24b46fb794e857aafdf0b42bbdca87 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012127.1 LN:15497061 M5:b2eab356457120f75acc44fb7aba375a UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012128.1 LN:13887164 M5:a38c11dd85dd4010c6dd5da40fc47a71 UR:file:/home/ornithology/anna_chromosome.fasta

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Matthew,

    The variable scattered_calling_intervals_list in the workflow is expecting a file with a list of interval files that should be stored in Google buckets. Here is an example of what it could look like: gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt
     
    In that file is a list of interval files stored in Google buckets. The workflow will localize each of those files and use them in the command. On the other hand, what it seems you are providing is an interval file, rather than a list of intervals. The workflow is expecting to need to localize each line in the file, but each line is an interval instead of a file so it doesn't work correctly.
     
    If you want to use this workflow you can either use the default scattered interval file linked above which is normally used for whole genomes, or you could try creating a file with one line and that line being the Google bucket path to the interval file you want to use.
     
    Kind regards,
    Jason
    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Jason:

    Thank you for your help.  How do I navigate to gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt ?

    There doesn't seem to be any way to access it.

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    HI Matthew Miller! You should be able to get to the Google bucket where that file is hosted with this link. The bucket is publicly available so you can either download the file or you can just point to it in your Workflow Configuration with the String path "gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt".

     

     

    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Ok. I found how to access the file (a google cloud tutorial video would help us noobs).

    So if I understand correctly, in the workflow, I would generate a file in a google bucket that simply points to the actual list file.

    That file could have a single line, for example:

    gs://my_buckets/test_bucket/my_scattered.interval_list

    And the file: my_scattered.interval_list would simply be the picard generated file:

    @HD VN:1.6 SO:coordinate
    @SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta

    ... etc.

    I'll try that!

     

    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Yes, I do need to create a new interval file, as I am not using a human genome. I am hoping that at least some of the terra workflows will work on non-model organisms. There are a lot of us that want to use the GATK pipeline to map to birds, mosquitoes, fish, etc!

    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Ok. So I have it running. But it didn't spark the run across multiple intervals. Instead only one interval is running.  I guess I don't really understand what is going on.

    In your system, you have a list with 50 lines, each pointing to a file in a bucket.  However, the 50 files that are pointed to appear to be identical. 

    Should I do the same thing, i.e. but the same file in 50 buckets and create a `my_scattered.interval_list` file that points to those 50 identical files?

    Thanks for your help!

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Matthew,

    The hg38_wgs_scattered_calling_intervals.txt contains 50 lines. Each line points to a different intervals list - the contents of each are different. You will see that each gsutil bucket path contains a temp_00**_of_50 notation to show that there are 50 total interval list files. If you run the command
    gsutil cat gs://gcp-public-data--broad-references/hg38/v0/scattered_calling_intervals/temp_0002_of_50/scattered.interval_list and look through, the contents of this file should be different than the others. Each of the scattered.interval_lists is a picard style interval list.

    You are correct in your understanding that the interval list in Terra will point to a file in a Google bucket that contains a path to the actual picard style interval file you want to use and yes you will need to generate one for your organism of choice.

    From your explanation it looks like you have a file named my_scattered.interval_list in the bucket gs://my_buckets/test_bucket. How many lines are in your my_scattered.interval_list ? If there are n lines in the file, you should get n scatters from the workflow. If you would like to share your my_scattered.interval_list , we are happy to take a look and see if it was set up correctly! As a test, you can break your interval list into two files, which should generate two scatters.

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Matthew Miller

    Jason:  You are so correct.

     

    I only looked at the header of the file (the part that is the SAM file style header) and didn't look at the bottom, interval section of that file. And I didn't realize that I needed to run Picard IntervalListTools after generating an interval list using Picard ScatterIntervalsByNs.  

    Doing those two allowed me to generate a folder of interval lists, and HaplotypeCaller is happily sparking to all of them now. I'm very happy about this!  Thank you so much!

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Matthew,

    Glad to hear it! If we can be of any further assistance, please let us know!

    Kind regards,

    Jason

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk