Interval problem with haplotypecaller-gvcf-gatk4 (suggested workflow
I am trying to run the Suggested Workflow: haplotypecaller-gvcf-gatk4
I have created an interval list using the Picard ScatterIntervalsByNs tool.
However, when I try to run the Workflow I get an error:
Caused by: java.lang.IllegalArgumentException: Could not build the path "RRCD01000051.1 1 24628 + ACGTmer". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage, DRS. Failures:
Google Cloud Storage: Path "RRCD01000051.1 1 24628 + ACGTmer" does not have a gcs scheme (IllegalArgumentException)
This error is thrown for each of the intervals in my interval list.
I'm guessing that the workflow is coded wrong, but I can't tell. Any suggestions?
Comments
11 comments
Hi Matthew,
Can you share your interval file and let us know how it was made? Can you confirm that it fits one of the mandatory formats listed in this document?
https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
Kind regards,
Jason
Jason: I accidentally responded to a different post. I made one interval list as a hand made list. It runs fine locally on my computer with the -L flag.
Here is the contents of the GATK-style list:
CM012145.1
CM012146.1
This works on my local GATK. That's why I think it is a Cromwell issue.
I made a second list using ScatterIntervalsByNs
Here are a few lines of that file. I'm realizing that the "UR:" field is wrong for Terra:
@HD VN:1.6 SO:coordinate
@SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012115.1 LN:151342139 M5:c184c92dc94c91404739ecf6f6899ced UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012116.1 LN:114810999 M5:4457415245bdf6f612e1706f51bec42b UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012117.1 LN:18597117 M5:ab515dae2e2b4fcd20a485c0f8116503 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012118.1 LN:16645885 M5:ebc880ff0aed1141f631443972eca55b UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012119.1 LN:35401958 M5:2de3cd2e5ca23c172fb2fbeeda66314e UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012120.1 LN:39139214 M5:62d596d414b8601f0be908d3d123cac8 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012121.1 LN:31090148 M5:cafc8a27b387908e1e25a89810ea90c4 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012122.1 LN:25686456 M5:fdfc8d2d78653af8228f36943d3ae47a UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012123.1 LN:22664390 M5:2dd29b7e52af7e0e6ddf5baa822d58f9 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012124.1 LN:20302349 M5:718cae94ea134a321b8d3749cebf0934 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012125.1 LN:21352500 M5:b9f95aa6419b1012081a422f024c0456 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012126.1 LN:17696115 M5:bd24b46fb794e857aafdf0b42bbdca87 UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012127.1 LN:15497061 M5:b2eab356457120f75acc44fb7aba375a UR:file:/home/ornithology/anna_chromosome.fasta
@SQ SN:CM012128.1 LN:13887164 M5:a38c11dd85dd4010c6dd5da40fc47a71 UR:file:/home/ornithology/anna_chromosome.fasta
Hi Matthew,
scattered_calling_intervals_listin the workflow is expecting a file with a list of interval files that should be stored in Google buckets. Here is an example of what it could look like: gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txtJason:
Thank you for your help. How do I navigate to gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt ?
There doesn't seem to be any way to access it.
HI Matthew Miller! You should be able to get to the Google bucket where that file is hosted with this link. The bucket is publicly available so you can either download the file or you can just point to it in your Workflow Configuration with the String path "gs://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt".
Ok. I found how to access the file (a google cloud tutorial video would help us noobs).
So if I understand correctly, in the workflow, I would generate a file in a google bucket that simply points to the actual list file.
That file could have a single line, for example:
gs://my_buckets/test_bucket/my_scattered.interval_list
And the file: my_scattered.interval_list would simply be the picard generated file:
@HD VN:1.6 SO:coordinate
@SQ SN:CM012114.1 LN:197551010 M5:92def859021d7c0d4466408ad15c22a3 UR:file:/home/ornithology/anna_chromosome.fasta
... etc.
I'll try that!
Yes, I do need to create a new interval file, as I am not using a human genome. I am hoping that at least some of the terra workflows will work on non-model organisms. There are a lot of us that want to use the GATK pipeline to map to birds, mosquitoes, fish, etc!
Ok. So I have it running. But it didn't spark the run across multiple intervals. Instead only one interval is running. I guess I don't really understand what is going on.
In your system, you have a list with 50 lines, each pointing to a file in a bucket. However, the 50 files that are pointed to appear to be identical.
Should I do the same thing, i.e. but the same file in 50 buckets and create a `my_scattered.interval_list` file that points to those 50 identical files?
Thanks for your help!
Hi Matthew,
The hg38_wgs_scattered_calling_intervals.txt contains 50 lines. Each line points to a different intervals list - the contents of each are different. You will see that each gsutil bucket path contains a
temp_00**_of_50notation to show that there are 50 total interval list files. If you run the commandgsutil catgs://gcp-public-data--broad-references/hg38/v0/scattered_calling_intervals/temp_0002_of_50/scattered.interval_listand look through, the contents of this file should be different than the others. Each of the scattered.interval_lists is a picard style interval list.You are correct in your understanding that the interval list in Terra will point to a file in a Google bucket that contains a path to the actual picard style interval file you want to use and yes you will need to generate one for your organism of choice.
From your explanation it looks like you have a file named
my_scattered.interval_listin the bucket gs://my_buckets/test_bucket. How many lines are in yourmy_scattered.interval_list? If there are n lines in the file, you should get n scatters from the workflow. If you would like to share yourmy_scattered.interval_list, we are happy to take a look and see if it was set up correctly! As a test, you can break your interval list into two files, which should generate two scatters.Kind regards,
Jason
Jason: You are so correct.
I only looked at the header of the file (the part that is the SAM file style header) and didn't look at the bottom, interval section of that file. And I didn't realize that I needed to run Picard IntervalListTools after generating an interval list using Picard ScatterIntervalsByNs.
Doing those two allowed me to generate a folder of interval lists, and HaplotypeCaller is happily sparking to all of them now. I'm very happy about this! Thank you so much!
Hi Matthew,
Glad to hear it! If we can be of any further assistance, please let us know!
Kind regards,
Jason
Please sign in to leave a comment.