Your genomics data are safe and sound in a Google bucket in the cloud: perhaps in a public-access database, or a shared bucket, or in your workspace Google bucket. How do you analyze your data in your workspace?
Data tables (in the workspace Data tab) are one way to link data to your workspace. You populate the data table with metadata (type and location of data) using a tab-delimited load file. The metadata can be used by the workflows to access input data.
- Understanding where your data are.. and are not
- Why use workspace data tables?
- Creating a basic load file of metadata
- Upload to your data table
- Next steps
1. Understanding where your data are.. and are not
The diagram below shows how data can exist in the cloud - in the Workspace storage bucket, a data library bucket or other storage - but be separate from your workspace. The data table contains metadata to connect the files in the cloud to the rest of your workspace.
You will know your data is NOT linked to the workspace data table if the Tables column is empty:
2. Why use workspace data tables?
You can manually input complete direct paths to your data as input to your workflow, but it's not a system that works well if you have more than a handful of files. Keeping all metadata in a workspace data table can save time and headache in the long term. They enable automation of back-to-back pipelines configured to read and write from the table. Tables can also be useful for organizing data as studies get more complex. The load file contains all you need to keep track of data, including intermediate outputs: what types (or "entities") of data you are working with, where the data are, and how the entities relate to each other (this can be useful if you have many samples from one participant, and perhaps many patients in a study, for example).
3. Creating a basic load file ("sample" file) of metadata
A load file has two parts: variable names (column headers) and rows of metadata. Each row corresponds to a different sample (or lane or other entity).
Note: Your data table can be contain as much information as you need, but at minimum needs only two columns: an id column and a data column (containing links to the input data files). You can include additional columns (to reference phenotype information, for example), and the data table will keep it all organized in one place.
Your basic load file will look like this:
See the step-by-step instructions below to create an upload file from scratch.
Helpful hints for writing and reading tsv files
You may find it easiest to use a spreadsheet editor, to keep the values in columns instead of all jammed together in one line.
3.1 Open a blank spreadsheet in your favorite app
The load file has two parts: column headers, which tell you what variable is in the column, and rows of metadata. Each row is one sample or entity. You'll be filling out headers for two columns, and one row of metadata.
3.2 Fill in the first column header (the ID column)
- In Terra this column header must have the format "entity: your_entity_name _id" where you fill in your_entity_name. The exact phrases "entity:" and "_id" are required.
- For this example, you should use "entity:sample_id" for the column header. We'll discuss how and when and why you might change this in a later tutorial.
Note that except for the first column, the order of the columns is unimportant.
3.3 Fill in the sample ID
This can be a participant number or a meaningful phrase (as in the screenshot above).
3.4 Fill in the second column (input) header
You will use the workflow variable name for the header. For example, if you are converting a CRAM file to a BAM file, the input data would be a CRAM file, and the column header could be just "cram":
3.5 Fill in the full path name for the input file
For data in a Google bucket, this will have the format `gs://odijfo84jenf7o/filename.cram`
Side note on Column C (and D, and...) - Other important or useful metadata
You can include other useful information, such as phenotype data or links to other genomic data, to keep it organized. The column header describes what goes in the column. In the screenshot below, we've included a column labeled, "Participant," that links each sample to an individual in a study:
5. Uploading to your data table
After saving your file as "tab delimited text", it's time to upload it to your workspace and link that data! Within the Data tab, click on the small “+” (to the right of the “Tables” header in the left column) and follow the prompts.
6. Next steps (to solidify this knowledge)
Watch a quick video tutorial of this process here
Practice generating and uploading a load file and running a WDL in this practice workspace (Exercise #1)
Note: In order to be able to use the practice workspace, you will need to clone the workspace against your own billing project.