Data Model: How to process readgroup-level files

Learn how to run workflows on data with readgroups.

This procedure is no longer necessary. The instructions in this article were necessary to run a workflow on readgroup-level data in an earlier version of Terra. Now you can upload readgroup data directly to a Terra data table, so this work-around is no longer necessary. To see examples of readgroup data tables, see the "read_group" and "read_group_set" tables in the Whole-Genome Analysis Pipeline featured workspace. To see an example of a workflow that uses readgroup-level data, see the "1-WholeGenomeGermlineSingleSample" workflow.

1. Set up the data model

In your samples table, declare individual samples as the actual samples you expect to have once your readgroup data is merged.

2. Attach readgroups FoFN to each sample

For each sample, provide a "file of file names" (which we commonly call FoFN) containing a list of paths to the readgroup files (typically FastQs or uBAMs) in your bucket. We typically use the gsutil command line utility (admittedly outside of FC) to generate the FoFN of readgroup files.

For example, if your readgroup file paths all contain the sample name in their filename, you can run a command to get a list of all file paths containing a particular sample name within a shared folder:
gsutil ls gs://bucket/path/to/readgroup_bams_folder/*sampleName* > sampleName.RG_bams.list

3. Rewire your method

Adapt the WDL you want to run to take in a FoFN input, and then use the read_lines() function to convert the contents of the FoFN into an array of readgroup file paths. In FireCloud, edit your Method Configuration to run on sample as the root entity.

The additions to your WDL would look something like this:

File file_of_filenames
Array[File] flowcell_unmapped_bams = read_lines(file_of_filenames)

4. Process your read groups

Within that WDL, you can run a scatter across the readgroup files.

The scatter block would look something like this:

# run on the readgroup files in parallel
scatter (unmapped_bam in flowcell_unmapped_bams) {
    # stick your per-readgroup calls here
    call something_that_maps_bams {
        input:
            input_bam = unmapped_bam
    }
}

5. Aggregate per sample

Optionally, you can run something that merges readgroup files per sample. The output of the scattered tasks will be arrays of whatever the task produces, so you can easily feed that to a merge operation that takes an array.

So your call would look something like this:

# output from something_that_maps_bams is automatically gathered into an array
 # when the call's output is referenced from outside of the scatter block.
 Array[File] mapped_bams = something_that_maps_bams.output_mapped_bam
    # merge processed readgroup files
    call something_that_merges_bams {
            input:
                input_bams = something_that_maps_bams.outputbam
        }

6. Wire up the output(s)

In FireCloud, link the final output of the WDL as a sample attribute, e.g., call it this.output in the method configuration. If the output of your WDL is a sample-level aggregate (e.g., per-sample bam) then you're all set to proceed. If the output is not yet at the aggregated sample level (e.g., it's an intermediate thing per-read group) you can glob it and run whatever step is next on the glob, or something to that effect. We can advise you on the particulars if needed; please ask questions in the comment thread.

Data Model: How to process readgroup-level files

1. Set up the data model

2. Attach readgroups FoFN to each sample

3. Rewire your method

4. Process your read groups

5. Aggregate per sample

6. Wire up the output(s)

Was this article helpful?

That’s great, can you tell us why? (Click all that apply)

Thanks for your feedback, help us improve by telling us what you think could be better (click all that apply)

Comments