Interval problem with haplotypecaller-gvcf-gatk4 (suggested workflow

Post author
Matthew Miller

I am trying to run the Suggested Workflow: haplotypecaller-gvcf-gatk4

I have created an interval list using the Picard ScatterIntervalsByNs tool.

However, when I try to run the Workflow I get an error:

Caused by: java.lang.IllegalArgumentException: Could not build the path "RRCD01000051.1 1 24628 + ACGTmer". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage, DRS. Failures:
Google Cloud Storage: Path "RRCD01000051.1 1 24628 + ACGTmer" does not have a gcs scheme (IllegalArgumentException)

This error is thrown for each of the intervals in my interval list.

I'm guessing that the workflow is coded wrong, but I can't tell. Any suggestions?

Comments

11 comments

  • Comment author
    Jason Cerrato

    Hi Matthew,

    Can you share your interval file and let us know how it was made? Can you confirm that it fits one of the mandatory formats listed in this document?

    https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

    Kind regards,

    Jason

    0
  • Comment author
    Matthew Miller

    Jason:  I accidentally responded to a different post. I made one interval list as a hand made list. It runs fine locally on my computer with the -L flag.

    Here is the contents of the GATK-style list:

    CM012145.1
    CM012146.1

    This works on my local GATK. That's why I think it is a Cromwell issue.

     

    I made a second list using  ScatterIntervalsByNs 

    Here are a few lines of that file. I'm realizing that the "UR:" field is wrong for Terra:

    @HD VN:1.6 SO:coordinate
    @SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012115.1 LN:151342139 M5:c184c92dc94c91404739ecf6f6899ced UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012116.1 LN:114810999 M5:4457415245bdf6f612e1706f51bec42b UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012117.1 LN:18597117 M5:ab515dae2e2b4fcd20a485c0f8116503 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012118.1 LN:16645885 M5:ebc880ff0aed1141f631443972eca55b UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012119.1 LN:35401958 M5:2de3cd2e5ca23c172fb2fbeeda66314e UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012120.1 LN:39139214 M5:62d596d414b8601f0be908d3d123cac8 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012121.1 LN:31090148 M5:cafc8a27b387908e1e25a89810ea90c4 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012122.1 LN:25686456 M5:fdfc8d2d78653af8228f36943d3ae47a UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012123.1 LN:22664390 M5:2dd29b7e52af7e0e6ddf5baa822d58f9 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012124.1 LN:20302349 M5:718cae94ea134a321b8d3749cebf0934 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012125.1 LN:21352500 M5:b9f95aa6419b1012081a422f024c0456 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012126.1 LN:17696115 M5:bd24b46fb794e857aafdf0b42bbdca87 UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012127.1 LN:15497061 M5:b2eab356457120f75acc44fb7aba375a UR:file:/home/ornithology/anna_chromosome.fasta
    @SQ SN:CM012128.1 LN:13887164 M5:a38c11dd85dd4010c6dd5da40fc47a71 UR:file:/home/ornithology/anna_chromosome.fasta

     

     

     

    0
  • Comment author
    Jason Cerrato

    Hi Matthew,

    0
  • Comment author
    Matthew Miller

    Jason:

    Thank you for your help.  How do I navigate to gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt ?

    There doesn't seem to be any way to access it.

    0
  • Comment author
    Sushma Chaluvadi

    HI Matthew Miller! You should be able to get to the Google bucket where that file is hosted with this link. The bucket is publicly available so you can either download the file or you can just point to it in your Workflow Configuration with the String path "gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt".

     

     

    0
  • Comment author
    Matthew Miller

    Ok. I found how to access the file (a google cloud tutorial video would help us noobs).

    So if I understand correctly, in the workflow, I would generate a file in a google bucket that simply points to the actual list file.

    That file could have a single line, for example:

    gs://my_buckets/test_bucket/my_scattered.interval_list

    And the file: my_scattered.interval_list would simply be the picard generated file:

    @HD VN:1.6 SO:coordinate
    @SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta

    ... etc.

    I'll try that!

     

    0
  • Comment author
    Matthew Miller

    Yes, I do need to create a new interval file, as I am not using a human genome. I am hoping that at least some of the terra workflows will work on non-model organisms. There are a lot of us that want to use the GATK pipeline to map to birds, mosquitoes, fish, etc!

    0
  • Comment author
    Matthew Miller

    Ok. So I have it running. But it didn't spark the run across multiple intervals. Instead only one interval is running.  I guess I don't really understand what is going on.

    In your system, you have a list with 50 lines, each pointing to a file in a bucket.  However, the 50 files that are pointed to appear to be identical. 

    Should I do the same thing, i.e. but the same file in 50 buckets and create a `my_scattered.interval_list` file that points to those 50 identical files?

    Thanks for your help!

    0
  • Comment author
    Jason Cerrato

    Hi Matthew,

    The hg38_wgs_scattered_calling_intervals.txt contains 50 lines. Each line points to a different intervals list - the contents of each are different. You will see that each gsutil bucket path contains a temp_00**_of_50 notation to show that there are 50 total interval list files. If you run the command
    gsutil cat gs://gcp-public-data--broad-references/hg38/v0/scattered_calling_intervals/temp_0002_of_50/scattered.interval_list and look through, the contents of this file should be different than the others. Each of the scattered.interval_lists is a picard style interval list.

    You are correct in your understanding that the interval list in Terra will point to a file in a Google bucket that contains a path to the actual picard style interval file you want to use and yes you will need to generate one for your organism of choice.

    From your explanation it looks like you have a file named my_scattered.interval_list in the bucket gs://my_buckets/test_bucket. How many lines are in your my_scattered.interval_list ? If there are n lines in the file, you should get n scatters from the workflow. If you would like to share your my_scattered.interval_list , we are happy to take a look and see if it was set up correctly! As a test, you can break your interval list into two files, which should generate two scatters.

    Kind regards,

    Jason

    0
  • Comment author
    Matthew Miller

    Jason:  You are so correct.

     

    I only looked at the header of the file (the part that is the SAM file style header) and didn't look at the bottom, interval section of that file. And I didn't realize that I needed to run Picard IntervalListTools after generating an interval list using Picard ScatterIntervalsByNs.  

    Doing those two allowed me to generate a folder of interval lists, and HaplotypeCaller is happily sparking to all of them now. I'm very happy about this!  Thank you so much!

    0
  • Comment author
    Jason Cerrato

    Hi Matthew,

    Glad to hear it! If we can be of any further assistance, please let us know!

    Kind regards,

    Jason

    0

Please sign in to leave a comment.