map Array[File] to Object[samplename, Array[Files] based on filename
We're creating a workflow to demultiplex Illumina bcl's to uBam using Picard. The downside here is that Picard creates files per lane.
I'm trying to merge the uBam's/sample/lane into 1 uBam/sample, based on the the sample name in de filenames.
Currently I have a flattened array of alle uBam files for all samples and lanes (subject to change if needed), but I'd need to do to something like this.
flattened_array = [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
"files": [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
"files": [fileX_L1.ubam, fileX_L2.ubam
so I can merge the files per lane into 1 file to feed into the mapping/variant calling workflow.
Any tips or tricks? I've kind of fixed it with a custom python script, but that doesn't scatter to one merge task per sample, which would be the ideal case.
This is vaguely related to I suppose.
Hi Matthias,
Thanks for writing in. We'll take a closer look at this request and get back to you as soon as we can!
Kind regards,
Hi Matthias,
There are two workflows for fastqtoUbam in this featured workspace that I *think demonstrate a way of doing what you want in Terra.
The Method repo has examples of other people's workflows for bcl conversion, but mostly to fastq , if that is helpful?
Hi Tiffany,
Thanks for the examples, but those don't really fix my question.
I'm starting from an array of files which contain all uBam's for all lanes of a flowcell for alle samples.
I'm trying to figure out how to construct an object or an array of mapped pairs that contains all files/ sample + a sample name derived from the filename of said file.
I have a script that does just that, but I'm struggling to get to a valid output definition, which I can then use later in the workflow.
Hope this clarifies my question.
Thanks again
I don't think there would be an easy way in WDL to reorganize the flattened array into a object[String, Array[Files]]. If there is, the best place to find it is in the WDL specification.
I'm not certain this would work but maybe have a task that organizes the file paths related to a sample written into files (also known as file of file names (fofn)). This would look something like:
within one of those files is the path to the related ubam file so for file1.txt.
Then have the task merging the ubam accept a txt file and downloads the context of the fileX.txt
That’s actually a great idea. I’ll have to try it. I will get back to you tomorrow. Thanks for the tip!
To get back to this, awesome idea, worked like a charm!
The appropriate way to group/aggregate elements of an array is to use the collect_by_key() function although this is only available in WDL development version. As an example, the following WDL:
Will output the the following array without invoking a task:
And it can be easily modified to group files the way needed.
Please sign in to leave a comment.