map Array[File] to Object[samplename, Array[Files] based on filename

October 23, 2020 09:00
7 comments

Hi,

We're creating a workflow to demultiplex Illumina bcl's to uBam using Picard. The downside here is that Picard creates files per lane.

I'm trying to merge the uBam's/sample/lane into 1 uBam/sample, based on the the sample name in de filenames.

Currently I have a flattened array of alle uBam files for all samples and lanes (subject to change if needed), but I'd need to do to something like this.

flattened_array = [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam

[{
"samplename":file1
"files": [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
},

{
"samplename":file1
"files": [fileX_L1.ubam, fileX_L2.ubam
}]

so I can merge the files per lane into 1 file to feed into the mapping/variant calling workflow.

Any tips or tricks? I've kind of fixed it with a custom python script, but that doesn't scatter to one merge task per sample, which would be the ideal case.

This is vaguely related to https://support.terra.bio/hc/en-us/community/posts/360060567131-Map-output-array-to-samples-in-set-?input_string=Convert%20Array%5BFile%5D%20to%20Object%5B I suppose.

Thanks

Matthias

Comments

7 comments

Jason Cerrato
- October 23, 2020 16:48
Hi Matthias,

Thanks for writing in. We'll take a closer look at this request and get back to you as soon as we can!

Kind regards,

Jason

0
Tiffany Miller
- October 23, 2020 18:12
Hi Matthias,

There are two workflows for fastqtoUbam in this featured workspace that I *think demonstrate a way of doing what you want in Terra.

The Method repo has examples of other people's workflows for bcl conversion, but mostly to fastq , if that is helpful?

Thanks!

0
Matthias De Smet
- October 24, 2020 11:23
Hi Tiffany,

Thanks for the examples, but those don't really fix my question.

I'm starting from an array of files which contain all uBam's for all lanes of a flowcell for alle samples.

I'm trying to figure out how to construct an object or an array of mapped pairs that contains all files/ sample + a sample name derived from the filename of said file.

I have a script that does just that, but I'm struggling to get to a valid output definition, which I can then use later in the workflow.

Hope this clarifies my question.

Thanks again

Matthias

0

Beri

Edited October 27, 2020 19:37

I don't think there would be an easy way in WDL to reorganize the flattened array into a object[String, Array[Files]]. If there is, the best place to find it is in the WDL specification.

I'm not certain this would work but maybe have a task that organizes the file paths related to a sample written into files (also known as file of file names (fofn)). This would look something like:

file1.txt
file2.txt
file3.txt

within one of those files is the path to the related ubam file so for file1.txt.

gs://sample1_L1.ubam
gs://sample1_L2.ubam
gs://sample1_L3.ubam

Then have the task merging the ubam accept a txt file and downloads the context of the fileX.txt

call create_fofn
     {
     input:
         array_of_files
     }

# Calles merge_ubam for each fofn in array of fofn
for fofn in array_fofn
    call merge_ubam 
     {
     input:
         array_of_files = read_lines(fofn)
     }

################################

task merge_ubam{
     input
     Array[Files] array_of_files
     command{...}

}

task create_fofn{
     input:
         Array[Files]
     # python script in command that organises ubam in respective fofn
     command{...}
     output{Array[Files] = [file1.txt,file2.txt,file*.txt]}

}

Matthias De Smet
- October 27, 2020 20:49
That’s actually a great idea. I’ll have to try it. I will get back to you tomorrow. Thanks for the tip!
Matthias

0
Matthias De Smet
- October 28, 2020 15:05
To get back to this, awesome idea, worked like a charm!

1

Giulio Genovese

December 23, 2020 16:40

The appropriate way to group/aggregate elements of an array is to use the collect_by_key() function although this is only available in WDL development version. As an example, the following WDL:

version development

workflow main {
  input {
    Array[String] flattened_array = ["file1_L1.ubam", "file1_L2.ubam", "file2_L1.ubam", "file2_L2.ubam", "fileX_L1.ubam", "fileX_L2.ubam"]
  }

  scatter (file in flattened_array) {
    String sample_array = basename(basename(file, "_L1.ubam"), "_L2.ubam")
  }

  output {
    Array[Pair[String, Array[String]]] grouped_array = as_pairs(collect_by_key(zip(sample_array, flattened_array)))
  }
}

Will output the the following array without invoking a task:

"main.grouped_array": [{
  "right": ["file1_L1.ubam", "file1_L2.ubam"],
  "left": "file1"
}, {
  "right": ["fileX_L1.ubam", "fileX_L2.ubam"],
  "left": "fileX"
}, {
  "right": ["file2_L1.ubam", "file2_L2.ubam"],
  "left": "file2"
}]

And it can be easily modified to group files the way needed.

Please sign in to leave a comment.