Creating a list file of reads for input to a workflow

Allie Hajian

Raw genomics data is in the form of many reads from the sequencer. Since it would be messy and time-consuming to type in the location of every one of these data files as input for a WDL, the input is often a 'list' file. This article is a step-by-step guide to how to create a list file of reads for input to a workflow. 

What is a list file?

A list file is just a list of all the data, where each row is a is a link to an unmapped BAM file in the cloud. E.g., it is the expected input to 1_Processing-For-Variant-Discovery

If you open a list file in a text editor, it looks like this:
List_file_Screen_Shot.png

Make a list file of data in a Google bucket using gcloud storage

1. Open a terminal configured to run gcloud storage. 

For detailed instructions on how to run gcloud storage in your terminal, see Moving data to/from a Google bucket.

2. Output a list of the bam files (in a Google bucket) to a local file. 

To copy to a file named `ubams.list` use the following command:

gcloud storage ls gs://your_data_Google_bucket_id/ > ubams.list

Note: You need to replace `your_data_Google_bucket_id` with the path to your workspace Google bucket (or wherever your data are). You can copy your workspace bucket path to your clipboard by clicking the clipboard icon at the far right of your dashboard tab under `Google bucket`. 

To save to a different list file name, replace "ubams.list" in the command line above with the filename of your choice. Just remember to use that filename in the commands below!!

3. Copyubams.list to your workspace Google bucket from your local machine. 

gcloud storage cp ubams.list gs://your_data_Google_bucket_id/ 

You can verify that the list file is in your workspace bucket by opening your Google bucket in a browser from the dashboard page (right column). 

Was this article helpful?

0 out of 0 found this helpful

Comments

9 comments

  • Comment author
    STEVEN GILHOOL

    I believe the first command in section 2 should read:

    gsutil ls gs://your_data_Google_bucket_id/ > ubams.list

    as opposed to "gs:/your_data_Google_bucket_id" written above.

    0
  • Comment author
    Allie Hajian

    Thanks for the catch, STEVEN GILHOOL! We on the User Ed team really appreciate when users help us identify errors big and small in our documentation. I updated the article with the correct command. 

    0
  • Comment author
    Andrew Davidson
    • Edited

    Hi Allie

    It is not clear how I would use this example? I took a look at https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl .  It is not clear which input argument ubams.list maps to.

    I need to process a list of files using wdl. My guess is there are at least 2 ways

    1. I think wld accepts a list of files. Given I have hundreds of files I am not sure if this will work. Assume I have a lot of local storage, does cromwell/wdl limit the max number of files a list? My guess is I will have to define the list in a wdl.input.json file to make this feasible. 

    2. use the approach you outlined above and pass a single file to my docker/wdl command. For this, to work I would need to use gsutil to copy the files locally. I know I can use gsutil in a juypter notebook. I assume to use gsutil from my docker I would need an image like https://hub.docker.com/r/google/cloud-sdk/ . It is not clear how authentication works. The documentation requires me to authenticate by logging on to the container interactively

    Kind regards Andy

    docker run -ti --name gcloud-config google/cloud-sdk gcloud auth login
    0
  • Comment author
    Pamela Bretscher

    Hi Andrew Davidson,

    Thank you for your questions about this. As far as which input argument the ubams.list maps to in the Pre-Processing for Variant Discovery pipeline, this should correspond to the "flowcell_unmapped_bams_list" required input. As far as I know, there shouldn't be a limit to the number of files that can be included in a list, but please let us know if you run into any issues with storage/file size. As for your last question, could you explain what you mean by it not being clear how authentication works for the docker? The authorization command essentially specifies the Google account to be used for downloading the files. This allows you to download files locally that are stored in a Terra workspace by specifying your Terra account login. Please let me know if you have any further questions. 

    Kind regards,

    Pamela

    0
  • Comment author
    Andrew Davidson

    Hi Pamela

    My understanding is our wdl task commands run in/on a docker container. We only have access to the commands and programs installed on the container. To use gsutil it would need to be installed. Most docker containers probably do not have gsutil, installed. Using google I found a container created by google cloud platform that has gsutil installed. They state that to use gsutil I would have to interactively log on to the running container. the container is launched by terra/cromwell. as far as I know this is not possible

    by contrast when I start a terra/juypternote book, the container already has gsutil and is configured to that gsutil is authenticated. this allows me to access the file in my bucket

     

    Kind regards

     

    Andy

     

    0
  • Comment author
    Pamela Bretscher

    Hi Andy,

    Thank you for your explanation. You're correct that in order to run gsutil in a docker, you would need to use the docker container that already has gsutil installed. However, you shouldn't have an issue logging in or authenticating your account when using this docker. In this article (https://support.terra.bio/hc/en-us/articles/4409101169051-Moving-data-to-from-a-Google-bucket-workspace-or-external-#heading-3), if you scroll down to "Step 1. Open gsutil in a terminal" and then click the "local terminal instance" option, you should see instructions on how to do this. Please let me know if this doesn't answer your question.

    Kind regards,

    Pamela

    0
  • Comment author
    Andrew Davidson
    • Edited

    Hi Pamela

    I really appreciate your tenacity. I realize Terra is beta. There are going to be some edge cases.

    The directions you provided are for transferring files between a terra workspace bucket and some other non-terra system like my local computer. The directions are not for transferring files between terra workspace bucket and a wdl based job. Terra/Cromwell/wdl normally does this automatically.  It uses gsutil. My problem is I can not get array[file] to work. I am basically hacking around trying to figure out how to get array[file] to work or find some other workaround.

    passing a single file that contains the list of files instead of using array[file] has a lot of merits. It would make my workflow much easier to manage and more reproducible than using array[file]

    It would be nice there was an easy way for me to "localize" the contents of my listOfFilesFile

    I realize terra is beta.  Based on my previous "reverse engineering" experiments I am not sure if this will work because of the authentication challenge. Is it possible to follow up with an eng ? it would save me a lot of time.

    based on my previous reverse engineering experiments. This is what I think happens

    1. cromwell generated a script based on my wdl
    2. Crowell starts a VM for my job
    3. Cromwell uses gsutil cp to "localize" my file input.
    4. cromwell executes docker 'my image' -v to make the localized file available to my container

    So it looks like gcp authentications happen on the VM but outside of my container. If I am correct I will not be able to use gsutil

    I guess I can hack a wdl that has a hardcoded bucket URL and see if gstuil cp works or

    0
  • Comment author
    Pamela Bretscher

    Hi Andy,

    Okay, thank you for your explanation. I apologize, I think was misunderstanding your use case, and the struggles you're seeing. I'm going to talk to some other members of the Terra team and I will follow up with you. 

    Kind regards,

    Pamela

    0
  • Comment author
    Andre Rico

    Hi everyone, does anyone have a solution for this? I'm working on a WDL that takes a list of files for processing, with the list being provided via a .txt file. The files are stored in the Workspace data bucket, and I'm trying to access them within the WDL using gs://... paths in the .txt file. However, I haven't been successful so far. Any insights or suggestions would be greatly appreciated!

    The test:

    version 1.0

    workflow download_workflow {
      input {
        File urls  # Input file containing list of URLs (one per line)
      }

      call download_task {
        input:
          urls_file = urls  # Pass the file containing the list of URLs to the task
      }

      output {
        Array[File] downloaded_files = download_task.downloaded_files
      }
    }

    task download_task {
      input {
        File urls_file  # The input file containing list of URLs
      }

      command <<<
        set -e -x -o pipefail

        # Display the contents of the file for debugging purposes
        echo "Content of the URLs file:"
        cat ~{urls_file}

        # Create a directory to store the downloaded files
        mkdir -p downloaded_files

        # Read the URLs from the input file and download each one
        while read -r url; do
          echo "Processing URL: $url"
          if [[ $url == http* ]]; then
            echo "Downloading $url via wget"
            wget -P downloaded_files $url --verbose
          elif [[ $url == gs://* ]]; then
            echo "Downloading $url via gsutil"
            gsutil cp $url downloaded_files/
          else
            echo "Unsupported URL format: $url"
          fi
        done < ~{urls_file}
      >>>

      output {
        # Capture all the downloaded files from the 'downloaded_files' directory
        Array[File] downloaded_files = glob("downloaded_files/*")
      }

      runtime {
        docker: "google/cloud-sdk:slim"  # Ensure the Docker image has both wget and gsutil
        memory: "4G"
        cpu: 2
      }
    }

    0

Please sign in to leave a comment.