How to include arrays in a data table

Allie Cliffe
  • Updated

Explore options for creating arrays of files or strings in a table cell. 

When to use arrays as attributes in table cells

You may want to use an array of cells in a data table if you have multiple data files of the same type that belong to one attribute in a single entity. Arrays are especially useful if your workflow takes an array as input.

That's a mouthful! What does it mean?

A genomic example would be if you have genotyping files in VCF format for a collection of samples, each sample might have a total of twenty-two VCF files, one for each chromosome. Your workflow may take in all twenty-two files (an array) as input and generate a single output file per sample.

Example: Sample with multiple files (an array) as input to a workflow

diagram of popup to add arrays as inputs in Terra data page with list type attributes

You don’t want a separate column in the sample data table for each input file: it's time-consuming to set up and manage, and requires launching workflows repeatedly to run separately on each file. You could use an entity_set table to feed the proper input to your workflow. But that can get complicated - especially since TSVs for sets of sets are not supported in Terra.  

Read on for three ways to set up a column of your sample table to include an array, making it easy to launch the workflow on the array.

Option 1: Create an array in Terra (small numbers of files)

If you only have a small number of input files for a small number of entities (e.g., samples), you can create an array in the sample table right in Terra using the "list" attribute type. 

1. Upload or copy your entity table from another workspace.

2. Add an attribute column with a single file (as a placeholder).

3. Click on the pencil icon to edit the cell that will contain the array of input files.

4. Select type string and check Value is a list.
Edit-attribute_Add-list-of-input-files_Screen_shot.png

5. Add all files in the array, one at a time, using Add item.

6. When done, select Save changes

Option 2: Upload a TSV file with arrays

Follow the directions here to generate a sample.tsv file. Then follow the array-formatting requirements below to add an array in the spreadsheet cell. 

Array formatting requirements

The array in your spreadsheet must have the format:

["gs://file-directory/file1-name","gs://file-directory/file2-name","etc."]
  • Array values must be between [].
  • Each file URL must be in double quotes.
  • File URLs must be separated by a comma.

Upload the TSV file by clicking the Import data button at the top left of the Data page and choosing Upload TSV from the menu. 

Option 3 (advanced): Import an array with a WDL

To run on each file without manual intervention in the example above, you want 1) a WDL that inputs an array of VCF files and 2) a way to input an array of files.

To get the array into your data table, you can write WDL code that will output a file of file paths or strings into an array format. This requires a file with a list of file paths or strings as the input. A task in your WDL can read the lines of the file and output it to your data model as an array.  Then, you can use the method configuration to assign it to a workspace attribute (“workspace.X”) or an attribute of the participant, sample, pair, or set that you are running on (“this.X”).

To generate a table with an array programmatically, you can use a WDL. In the example above, the input would be a file that has a list of VCF file paths, one per line using “gs://” format.

Example 1: Manipulating the array with a WDL

The code below has a command portion left blank so that you can manipulate the array if you desire. This WDL will copy your files to the virtual machine the task spins up, which makes sense if you are manipulating the array of files further. The 50 GB disk size is to account for copying those files to the virtual machine. You will want to change for your use case. 

Example 1’s WDL and configuration (JSON) are published in the Methods Repository

workflow fof_usage_wf {
   File file_of_files
   call fof_usage_task {
    input:
     fof = file_of_files
  }
   output {
    Array[File] array_output = fof_usage_task.array_of_files
   }
}
task fof_usage_task {
   File fof
   Array[File] my_files=read_lines(fof)
   command {
   #do stuff with arrays below
   #....
   }
   runtime {
       docker : "ubuntu:16.04"
       disks: "local-disk 50 HDD"
       memory: "2 GB"
   }
   output {
    Array[File] array_of_files = my_files
    }  
}

Example 2: without manipulating the array

workflow fof_usage_wf {
   File file_of_files
   Array[File] array_of_files = read_lines(file_of_files)

   output {
    Array[File] array_output = array_of_files
   }
}

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.