map Array[File] to Object[samplename, Array[Files] based on filename
Hi,
We're creating a workflow to demultiplex Illumina bcl's to uBam using Picard. The downside here is that Picard creates files per lane.
I'm trying to merge the uBam's/sample/lane into 1 uBam/sample, based on the the sample name in de filenames.
Currently I have a flattened array of alle uBam files for all samples and lanes (subject to change if needed), but I'd need to do to something like this.
flattened_array = [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
to
[{
"samplename":file1
"files": [file1_L1.ubam, file2_L2.ubam, fileX_L1.ubam, fileX_L2.ubam
},
{
"samplename":file1
"files": [fileX_L1.ubam, fileX_L2.ubam
}]
so I can merge the files per lane into 1 file to feed into the mapping/variant calling workflow.
Any tips or tricks? I've kind of fixed it with a custom python script, but that doesn't scatter to one merge task per sample, which would be the ideal case.
This is vaguely related to https://support.terra.bio/hc/en-us/community/posts/360060567131-Map-output-array-to-samples-in-set-?input_string=Convert%20Array%5BFile%5D%20to%20Object%5B I suppose.
Thanks
Matthias
-
Hi Matthias,
There are two workflows for fastqtoUbam in this featured workspace that I *think demonstrate a way of doing what you want in Terra.
The Method repo has examples of other people's workflows for bcl conversion, but mostly to fastq , if that is helpful?
Thanks!
-
Hi Tiffany,
Thanks for the examples, but those don't really fix my question.
I'm starting from an array of files which contain all uBam's for all lanes of a flowcell for alle samples.
I'm trying to figure out how to construct an object or an array of mapped pairs that contains all files/ sample + a sample name derived from the filename of said file.
I have a script that does just that, but I'm struggling to get to a valid output definition, which I can then use later in the workflow.
Hope this clarifies my question.
Thanks again
Matthias
-
I don't think there would be an easy way in WDL to reorganize the flattened array into a object[String, Array[Files]]. If there is, the best place to find it is in the WDL specification.
I'm not certain this would work but maybe have a task that organizes the file paths related to a sample written into files (also known as file of file names (fofn)). This would look something like:
file1.txt
file2.txt
file3.txt
within one of those files is the path to the related ubam file so for file1.txt.gs://sample1_L1.ubam
gs://sample1_L2.ubam
gs://sample1_L3.ubamThen have the task merging the ubam accept a txt file and downloads the context of the fileX.txt
call create_fofn
{
input:
array_of_files
}
# Calles merge_ubam for each fofn in array of fofn
for fofn in array_fofn
call merge_ubam
{
input:
array_of_files = read_lines(fofn)
}
################################
task merge_ubam{
input
Array[Files] array_of_files
command{...}
}
task create_fofn{
input:
Array[Files]
# python script in command that organises ubam in respective fofn
command{...}
output{Array[Files] = [file1.txt,file2.txt,file*.txt]}
} -
The appropriate way to group/aggregate elements of an array is to use the collect_by_key() function although this is only available in WDL development version. As an example, the following WDL:
version development
workflow main {
input {
Array[String] flattened_array = ["file1_L1.ubam", "file1_L2.ubam", "file2_L1.ubam", "file2_L2.ubam", "fileX_L1.ubam", "fileX_L2.ubam"]
}
scatter (file in flattened_array) {
String sample_array = basename(basename(file, "_L1.ubam"), "_L2.ubam")
}
output {
Array[Pair[String, Array[String]]] grouped_array = as_pairs(collect_by_key(zip(sample_array, flattened_array)))
}
}Will output the the following array without invoking a task:
"main.grouped_array": [{
"right": ["file1_L1.ubam", "file1_L2.ubam"],
"left": "file1"
}, {
"right": ["fileX_L1.ubam", "fileX_L2.ubam"],
"left": "fileX"
}, {
"right": ["file2_L1.ubam", "file2_L2.ubam"],
"left": "file2"
}]And it can be easily modified to group files the way needed.
Please sign in to leave a comment.
Comments
7 comments