Your genomics data are safe and sound in a Google bucket in the cloud. Perhaps in a public-access data base, or a shared bucket, or in the Google bucket associated with your workspace. However, the tools in your workspace can't actually access that data until you link them. Using a data table (in your workspace Data tab) is one way to accomplish this. You can populate the data tables in your Data tab with metadata using a tab delimited load file. The metadata directs your workflows to the type and location of your data. This article explains how to link data in a Google bucket to the data table in your workspace.
- Understanding where your data are.. and are not
- Why use workspace data tables?
- Creating a basic load file of metadata
- Dissecting load files... one column at a time
- How to upload files to your workspace
- Next steps
1. Understanding where your data are.. and are not
The diagram below shows how the data are in the Workspace storage bucket, a data library bucket or other storage, but not yet connected to your workspace . The data model (or data table) contains the metadata to connect the files in the cloud to the rest of your workspace.
You will know your data is NOT connected in this way if the data table in your workspace is empty (try clicking "Tables" at the left in the Data tab). The steps below will outline how to create and upload a load file of metadata to populate the Data tab:
2. Why use workspace data tables?
Metadata can be hardcoded into your workflow, but keeping all metadata in a workspace data table can really help in the long term. For example, writing outputs to the data table can be useful for automation if the output of one workflow becomes the input for the next. Tables can also be useful as studies get more complicated, since the table can help with organization: keeping track of what types (or "entities") of data you are working with, where the data are, and organizing how the entities relate to each other (this can be useful if you have many samples from one participant, and perhaps many patients in a study, for example).
This article will show you how to populate your workspace Data model by generating and uploading a load file (in tab-separated value format) with metadata on all of the input data you will use in your study.
3. Creating a basic load file of metadata
Say you want to run GATK Best Practices WDL. You have a workspace with the WDL, and the right kind of data in a Google bucket. If your workflow is looking for inputs in a Table in your Data tab, you will need to generate an appropriate load file to populate the workspace Table. We'll call it "sample".
Note: Your Data table can be as big as you need it to be, but can start out quite simple. Only two columns are absolutely needed: an id column and a data column (containing links to the data). Note that you can include additional columns of data that you may want to reference, and the Data table will keep it all organized in one place.
Your basic upload file will look like this:
See the step-by-step instructions below to create an upload file from scratch.
Helpful hints for writing and reading tsv files
Note: It's easiest to work with tab-separated or tab delimited files in a spreadsheet editor such as Excel. Although technically a text file, reading or editing in a text editor means the cells all run together in one long line.
To generate your own sample.txt or sample.tsv file:
- Open a blank spreadsheet in your favorite app. The load file has two parts: column headers, which tell you what variable is in the column, and rows of metadata. Each row is one sample.
- This column is the ID column. Use a header of the format "entity:your_name_id" for the first column. "your_name" can be whatever you like. If it's sample, it is often "sample" or if data from a particular lane, it could be "lane" or something meaningful to you.
- The second column is usually the input data, and you can use a header that describes what the data is... For example, if you are converting a CRAM file to a BAM file, the input data would be a CRAM file, and the column header could be just "cram" (see the screenshot below).
- Note that except for the first column, the order of the columns is unimportant.
- When you're done editing, save the file as a "tab delimited" file.
4. Dissecting load files one column at a time
Let's go through a sample load file column by column:
Column A - The entity ID
Each cell in this column is a unique identifier for the entity - a biological sample, for example. Let's say we have a CRAM file from a sample that we want to convert to a BAM file.
Header Note: The top row has to be in the format "entity: your_entity_name _id" where you fill in your_entity_name . The exact phrases "entity:" and "_id" are required. For this example, you should use "entity:sample_id" for the column header. We'll discuss how and when and why you might change this in a later tutorial.
Column B - The input (data) for the workflow WDL
Column C (and D, and...) - Other important or useful metadata
In the screenshot below, we've included a label, "Participant," that links each sample to an individual in a study. You could include other useful information, such as phenotype data or links to other genomic data. The column header describes what goes in the column.
5. How to upload load files to your Terra workspace
After saving your sample file as "tab delimited text", it's time to upload it to your workspace and link that data! Within the Data tab, click on the small “+” (to the right of the “Tables” header in the left column) and follow the prompts.
6. Next steps (to solidify this knowledge)
Watch a quick video tutorial of this process here
Practice generating and uploading a load file and running a WDL in this practice workspace (Exercise #1)
Note: In order to be able to use the practice workspace, you will need to clone the workspace against your own billing project.