Ah, the eternal question of how to deal with readgroups in a world where the data model does not acknowledge their existence. In a not-so-distant future where rainbows and unicorns roam the clouds, we'll have a more flexible data model where you can explicitly put read groups one level under the samples entity table.
In the meantime, here's how we recommend dealing with readgroup-level files:
1. Set up the data model
In your samples table, declare individual samples as the actual samples you expect to have once your readgroup data will be merged.
2. Attach readgroups FoFN to each sample
For each sample, provide a "file of file names" (which we commonly call
FoFN) containing a list of paths to the readgroup files (typically FastQs or uBAMs) in your bucket. We typically use the
gsutil command line utility (admittedly outside of FC) to generate the FoFN of readgroup files.
For example, if your readgroup file paths all contain the sample name in their filename, then you could run a command to get a list of all file paths containing a particular sample name within a shared folder:
gsutil ls gs://bucket/path/to/readgroup_bams_folder/*sampleName* > sampleName.RG_bams.list
3. Rewire your method
Adapt the WDL you want to run to take in a FoFN input, and then use the
read_lines() function to convert the contents of the FoFN into an array of readgroup file paths. In FireCloud, edit your Method Configuration to run on
sample as the root entity.
The additions to your WDL would look something like this:
File file_of_filenames Array[File] flowcell_unmapped_bams = read_lines(file_of_filenames)
4. Process your read groups
Within that WDL, you can then run a scatter across the readgroup files.
The scatter block would look something like this: