map Array[File] to Object[samplename, Array[Files] based on filename

Post author
Matthias De Smet

Hi,

We're creating a workflow to demultiplex Illumina bcl's to uBam using Picard. The downside here is that Picard creates files per lane.

I'm trying to merge the uBam's/sample/lane into 1 uBam/sample, based on the the sample name in de filenames.

Currently I have a flattened array of alle uBam files for all samples and lanes (subject to change if needed), but I'd need to do to something like this.

 

flattened_array = [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam

to 

[{
"samplename":file1
"files": [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
},

{
"samplename":file1
"files": [fileX_L1.ubam, fileX_L2.ubam
}]

so I can merge the files per lane into 1 file to feed into the mapping/variant calling workflow.

Any tips or tricks? I've kind of fixed it with a custom python script, but that doesn't scatter to one merge task per sample, which would be the ideal case.

This is vaguely related to https://support.terra.bio/hc/en-us/community/posts/360060567131-Map-output-array-to-samples-in-set-?input_string=Convert%20Array%5BFile%5D%20to%20Object%5B I suppose.

 

Thanks

Matthias

Comments

7 comments

  • Comment author
    Jason Cerrato

    Hi Matthias,

    Thanks for writing in. We'll take a closer look at this request and get back to you as soon as we can!

    Kind regards,

    Jason

    0
  • Comment author
    Tiffany Miller

    Hi Matthias, 

    There are two workflows for fastqtoUbam in this featured workspace that I *think demonstrate a way of doing what you want in Terra.

    The Method repo has examples of other people's workflows for bcl conversion, but mostly  to fastq , if that is helpful?

    Thanks!

     

     

    0
  • Comment author
    Matthias De Smet

    Hi Tiffany,

     

    Thanks for the examples, but those don't really fix my question.

    I'm starting from an array of files which contain all uBam's for all lanes of a flowcell for alle samples.

    I'm trying to figure out how to construct an object or an array of mapped pairs that contains all files/ sample + a sample name derived from the filename of said file.

    I have a script that does just that, but I'm struggling to get to a valid output definition, which I can then use later in the workflow.

    Hope this clarifies my question.

     

    Thanks again

    Matthias

     

    0
  • Comment author
    Beri
    • Edited

    I don't think there would be an easy way in WDL to reorganize the flattened array into a object[String, Array[Files]]. If there is, the best place to find it is in the WDL specification

    I'm not certain this would work but maybe have a task that organizes the file paths related to a sample written into files (also known as file of file names (fofn)). This would look something like:  

    file1.txt
    file2.txt
    file3.txt


    within one of those files is the path to the related ubam file so for file1.txt. 

    gs://sample1_L1.ubam
    gs://sample1_L2.ubam
    gs://sample1_L3.ubam

    Then have the task merging the ubam accept a txt file and downloads the context of the fileX.txt

    call create_fofn
         {
         input:
             array_of_files
         }

    # Calles merge_ubam for each fofn in array of fofn
    for fofn in array_fofn
        call merge_ubam 
         {
         input:
             array_of_files = read_lines(fofn)
         }

    ################################

    task merge_ubam{
    input
    Array[Files] array_of_files
    command{...}

    }

    task create_fofn{
         input:
             Array[Files]
    # python script in command that organises ubam in respective fofn
    command{...}
    output{Array[Files] = [file1.txt,file2.txt,file*.txt]}

    }

     

    1
  • Comment author
    Matthias De Smet

    That’s actually a great idea. I’ll have to try it. I will get back to you tomorrow. Thanks for the tip!
    Matthias

    0
  • Comment author
    Matthias De Smet

    To get back to this, awesome idea, worked like a charm!

    1
  • Comment author
    Giulio Genovese

    The appropriate way to group/aggregate elements of an array is to use the collect_by_key() function although this is only available in WDL development version. As an example, the following WDL:

    version development

    workflow main {
    input {
    Array[String] flattened_array = ["file1_L1.ubam", "file1_L2.ubam", "file2_L1.ubam", "file2_L2.ubam", "fileX_L1.ubam", "fileX_L2.ubam"]
    }

    scatter (file in flattened_array) {
    String sample_array = basename(basename(file, "_L1.ubam"), "_L2.ubam")
    }

    output {
    Array[Pair[String, Array[String]]] grouped_array = as_pairs(collect_by_key(zip(sample_array, flattened_array)))
    }
    }

    Will output the the following array without invoking a task:

    "main.grouped_array": [{
    "right": ["file1_L1.ubam", "file1_L2.ubam"],
    "left": "file1"
    }, {
    "right": ["fileX_L1.ubam", "fileX_L2.ubam"],
    "left": "fileX"
    }, {
    "right": ["file2_L1.ubam", "file2_L2.ubam"],
    "left": "file2"
    }]

    And it can be easily modified to group files the way needed.

    0

Please sign in to leave a comment.