Raw genomics data is in the form of many reads from the sequencer. Since it would be messy and time-consuming to type in the location of every one of these data files as input for a WDL, the input is often a 'list' file. This article is a step-by-step guide to how to create a list file of reads for input to a workflow.
What is a list file?
A list file is just a list of all the data, where each row is a is a link to an unmapped BAM file in the cloud. E.g., it is the expected input to 1_Processing-For-Variant-Discovery.
If you open a list file in a text editor, it looks like this:
Make a list file of data in your Google bucket using gsutil
1. Open a terminal configured to run gsutil.
For detailed instructions on how to run gsutil in your terminal, see Moving data to/from a Google bucket.
2. Output a list of the bam files (in a Google bucket) to a local file.
To copy to a file named `ubams.list` use the following command:
gsutil ls gs://your_data_Google_bucket_id/ > ubams.list
Note: You need to replace `your_data_Google_bucket_id` with the path to your workspace Google bucket (or wherever your data are). You can copy your workspace bucket path to your clipboard by clicking the clipboard icon at the far right of your dashboard tab under `Google bucket`.
To save to a different list file name, replace "ubams.list" in the command line above with the filename of your choice. Just remember to use that filename in the commands below!!
3. Copyubams.list to your workspace Google bucket from your local machine.
gsutil cp ubams.list gs://your_data_Google_bucket_id/
You can verify that the list file is in your workspace bucket by opening your Google bucket in a browser from the dashboard page (right column).
I believe the first command in section 2 should read:
as opposed to "gs:/your_data_Google_bucket_id" written above.
Thanks for the catch, STEVEN GILHOOL! We on the User Ed team really appreciate when users help us identify errors big and small in our documentation. I updated the article with the correct command.
It is not clear how I would use this example? I took a look at https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl . It is not clear which input argument ubams.list maps to.
I need to process a list of files using wdl. My guess is there are at least 2 ways
1. I think wld accepts a list of files. Given I have hundreds of files I am not sure if this will work. Assume I have a lot of local storage, does cromwell/wdl limit the max number of files a list? My guess is I will have to define the list in a wdl.input.json file to make this feasible.
2. use the approach you outlined above and pass a single file to my docker/wdl command. For this, to work I would need to use gsutil to copy the files locally. I know I can use gsutil in a juypter notebook. I assume to use gsutil from my docker I would need an image like https://hub.docker.com/r/google/cloud-sdk/ . It is not clear how authentication works. The documentation requires me to authenticate by logging on to the container interactively
Kind regards Andy
Hi Andrew Davidson,
Thank you for your questions about this. As far as which input argument the ubams.list maps to in the Pre-Processing for Variant Discovery pipeline, this should correspond to the "flowcell_unmapped_bams_list" required input. As far as I know, there shouldn't be a limit to the number of files that can be included in a list, but please let us know if you run into any issues with storage/file size. As for your last question, could you explain what you mean by it not being clear how authentication works for the docker? The authorization command essentially specifies the Google account to be used for downloading the files. This allows you to download files locally that are stored in a Terra workspace by specifying your Terra account login. Please let me know if you have any further questions.
My understanding is our wdl task commands run in/on a docker container. We only have access to the commands and programs installed on the container. To use gsutil it would need to be installed. Most docker containers probably do not have gsutil, installed. Using google I found a container created by google cloud platform that has gsutil installed. They state that to use gsutil I would have to interactively log on to the running container. the container is launched by terra/cromwell. as far as I know this is not possible
by contrast when I start a terra/juypternote book, the container already has gsutil and is configured to that gsutil is authenticated. this allows me to access the file in my bucket
Thank you for your explanation. You're correct that in order to run gsutil in a docker, you would need to use the docker container that already has gsutil installed. However, you shouldn't have an issue logging in or authenticating your account when using this docker. In this article (https://support.terra.bio/hc/en-us/articles/4409101169051-Moving-data-to-from-a-Google-bucket-workspace-or-external-#heading-3), if you scroll down to "Step 1. Open gsutil in a terminal" and then click the "local terminal instance" option, you should see instructions on how to do this. Please let me know if this doesn't answer your question.
I really appreciate your tenacity. I realize Terra is beta. There are going to be some edge cases.
The directions you provided are for transferring files between a terra workspace bucket and some other non-terra system like my local computer. The directions are not for transferring files between terra workspace bucket and a wdl based job. Terra/Cromwell/wdl normally does this automatically. It uses gsutil. My problem is I can not get array[file] to work. I am basically hacking around trying to figure out how to get array[file] to work or find some other workaround.
passing a single file that contains the list of files instead of using array[file] has a lot of merits. It would make my workflow much easier to manage and more reproducible than using array[file]
It would be nice there was an easy way for me to "localize" the contents of my listOfFilesFile
I realize terra is beta. Based on my previous "reverse engineering" experiments I am not sure if this will work because of the authentication challenge. Is it possible to follow up with an eng ? it would save me a lot of time.
based on my previous reverse engineering experiments. This is what I think happens
So it looks like gcp authentications happen on the VM but outside of my container. If I am correct I will not be able to use gsutil
I guess I can hack a wdl that has a hardcoded bucket URL and see if gstuil cp works or
Okay, thank you for your explanation. I apologize, I think was misunderstanding your use case, and the struggles you're seeing. I'm going to talk to some other members of the Terra team and I will follow up with you.
Please sign in to leave a comment.