Explore options for creating arrays of files or strings in a table cell.
When to use arrays as attributes in table cells
You may want to use an array of cells in a data table if you have multiple data files of the same type that belong to one attribute in a single entity. Arrays are especially useful if your workflow takes an array as input.
That's a mouthful! What does it mean?
A genomic example would be if you have genotyping files in VCF format for a collection of samples, each sample might have a total of twenty-two VCF files, one for each chromosome. Your workflow may take in all twenty-two files (an array) as input and generate a single output file per sample.
Example: Sample with multiple files (an array) as input to a workflow
You don’t want a separate column in the sample data table for each input file: it's time-consuming to set up and manage, and requires launching workflows repeatedly to run separately on each file. You could use an entity_set table to feed the proper input to your workflow. But that can get complicated - especially since TSVs for sets of sets are not supported in Terra.
Read on for three ways to set up a column of your sample table to include an array, making it easy to launch the workflow on the array.
Option 1: Create an array in Terra (small numbers of files)
If you only have a small number of input files for a small number of entities (e.g., samples), you can create an array in the sample table right in Terra using the "list" attribute type.
1. Upload or copy your entity table from another workspace.
2. Add an attribute column with a single file (as a placeholder).
3. Click on the pencil icon to edit the cell that will contain the array of input files.
4. Select type string and check Value is a list.
5. Add all files in the array, one at a time, using Add item.
6. When done, select Save changes.
Option 2: Upload a TSV file with arrays
Follow the directions here to generate a sample.tsv file. Then follow the array-formatting requirements below to add an array in the spreadsheet cell.
Array formatting requirements
The array in your spreadsheet must have the format:
["gs://file-directory/file1-name","gs://file-directory/file2-name","etc."]
- Array values must be between [].
- Each file URL must be in double quotes.
- File URLs must be separated by a comma.
Upload the TSV file by clicking the Import data button at the top left of the Data page and choosing Upload TSV from the menu.
Option 3 (advanced): Import an array with a WDL
To run on each file without manual intervention in the example above, you want 1) a WDL that inputs an array of VCF files and 2) a way to input an array of files.
To get the array into your data table, you can write WDL code that will output a file of file paths or strings into an array format. This requires a file with a list of file paths or strings as the input. A task in your WDL can read the lines of the file and output it to your data model as an array. Then, you can use the method configuration to assign it to a workspace attribute (“workspace.X”) or an attribute of the participant, sample, pair, or set that you are running on (“this.X”).
To generate a table with an array programmatically, you can use a WDL. In the example above, the input would be a file that has a list of VCF file paths, one per line using “gs://” format.
Example 1: Manipulating the array with a WDL
The code below has a command portion left blank so that you can manipulate the array if you desire. This WDL will copy your files to the virtual machine the task spins up, which makes sense if you are manipulating the array of files further. The 50 GB disk size is to account for copying those files to the virtual machine. You will want to change for your use case.
Example 1’s WDL and configuration (JSON) are published in the Methods Repository
workflow fof_usage_wf {
File file_of_files
call fof_usage_task {
input:
fof = file_of_files
}
output {
Array[File] array_output = fof_usage_task.array_of_files
}
}
task fof_usage_task {
File fof
Array[File] my_files=read_lines(fof)
command {
#do stuff with arrays below
#....
}
runtime {
docker : "ubuntu:16.04"
disks: "local-disk 50 HDD"
memory: "2 GB"
}
output {
Array[File] array_of_files = my_files
}
}
Example 2: without manipulating the array
workflow fof_usage_wf {
File file_of_files
Array[File] array_of_files = read_lines(file_of_files)
output {
Array[File] array_output = array_of_files
}
}