map Array[File] to Object[samplename, Array[Files] based on filename
We're creating a workflow to demultiplex Illumina bcl's to uBam using Picard. The downside here is that Picard creates files per lane.
I'm trying to merge the uBam's/sample/lane into 1 uBam/sample, based on the the sample name in de filenames.
Currently I have a flattened array of alle uBam files for all samples and lanes (subject to change if needed), but I'd need to do to something like this.
flattened_array = [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
"files": [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
"files": [fileX_L1.ubam, fileX_L2.ubam
so I can merge the files per lane into 1 file to feed into the mapping/variant calling workflow.
Any tips or tricks? I've kind of fixed it with a custom python script, but that doesn't scatter to one merge task per sample, which would be the ideal case.
This is vaguely related to https://support.terra.bio/hc/en-us/community/posts/360060567131-Map-output-array-to-samples-in-set-?input_string=Convert%20Array%5BFile%5D%20to%20Object%5B I suppose.
I don't think there would be an easy way in WDL to reorganize the flattened array into a object[String, Array[Files]]. If there is, the best place to find it is in the WDL specification.
I'm not certain this would work but maybe have a task that organizes the file paths related to a sample written into files (also known as file of file names (fofn)). This would look something like:
within one of those files is the path to the related ubam file so for file1.txt.
Then have the task merging the ubam accept a txt file and downloads the context of the fileX.txt
To get back to this, awesome idea, worked like a charm!
Thanks for writing in. We'll take a closer look at this request and get back to you as soon as we can!
There are two workflows for fastqtoUbam in this featured workspace that I *think demonstrate a way of doing what you want in Terra.
The Method repo has examples of other people's workflows for bcl conversion, but mostly to fastq , if that is helpful?
Thanks for the examples, but those don't really fix my question.
I'm starting from an array of files which contain all uBam's for all lanes of a flowcell for alle samples.
I'm trying to figure out how to construct an object or an array of mapped pairs that contains all files/ sample + a sample name derived from the filename of said file.
I have a script that does just that, but I'm struggling to get to a valid output definition, which I can then use later in the workflow.
Hope this clarifies my question.
That’s actually a great idea. I’ll have to try it. I will get back to you tomorrow. Thanks for the tip!
The appropriate way to group/aggregate elements of an array is to use the collect_by_key() function although this is only available in WDL development version. As an example, the following WDL:
Will output the the following array without invoking a task:
And it can be easily modified to group files the way needed.
Please sign in to leave a comment.