How to use files from Terra library data in workflow?
Hello,
I can't figure out how to use Terra library data in my workflows.
I have a workflow that takes, among others, a VCF file. For example a VCF file from 1000 Genomes project would work well. Without Terra, I would just upload a VCF file to my Google bucket and refer to it by URL in the workflow inputs.
It appears Terra has 1000 Genomes Data. But I can't see what are the actual files available, or how to use them. All I see is charts showing cohort statistics. I even imported to one of my workspaces, and now I have a table of samples in my workspace, but I can't see any related file in the Google bucket of my workspace.
How do I use data files from the Terra library data in my workflows?
Thanks!
Best, Oliver
Comments
5 comments
Hi Oliver,
When you import data from 1000 Genomes, for example, there are no physical copies made of those files which is why you do not see them in your Google bucket. Rather, what is copied is metadata pointing to the location where the actual files live.
To use the files you would use the Attributes column of your Workflow. For example, my Workflow requires a sample input denoted in the "Process multiple workflows from: Sample" section:
If I continue to scroll down, I will see that there are some Attributes that I need to fill out. This column is where you tell your Workflow which Data table to look at and which column to read from (for each row of that table):
In the above screenshot, you will see that there are 2 variables that this Workflow looks for (amongst others that I did not screenshot). One is input_bam and the other is input_bam_index. Which files should be assigned to each of these variables? That is what the Attributes column does.
this.analysis_read_bam will point to the Sample table to the analysis_ready_bam column for each row, or Sample, in the Sample table.
this.analysis_ready_bam_index will point to the Sample table to the analysis_read_bam_index column for each row, or Sample, in the Sample table.
"this." points to the Sample table because that is what was chosen in the "Process multiple workflows from" drop-down menu. Had that listed "Participant", this.analysis_read_bam would point to the Participant table - this may fail if the Participant table does not contain that column.
Below is a screenshot of the Sample table in this Workspace:
You can see that the columns are named analysis_read_bam and analysis_ready_bam_index.
So depending on what the column names are in your Sample data table, you would list that in your Workflow Attributes.
Please not that this is a very general overview. If you would like to learn more about how to set up your Workflow and your Data Model, there is detailed documentation here: https://support.terra.bio/hc/en-us/categories/360001399872-Doing-Research-on-Terra
Sushma
Hello,
Thank you for the response! That makes sense that files are not copied to my workspace, since a reference to another workspace, if that one is public, should be sufficient.
However, my problem is that I cannot find the data files anywhere and don't know how to refer to them.
For example, the data referred to here:
https://app.terra.bio/#library/datasets/public/1000%20Genomes/data-explorer
Where are the files, and how can I refer to them?
Thanks!
Best, Oliver
Hi Oliver,
I see now that when you export data from the 1000 Genomes, it pastes into your Data Model a BigQuery query. Using this data is slightly different. This is a video to a talk that walks through the steps on how to do this. This is based on a Workspace that we used as an example in the workshop that you are welcome to clone and follow along the video: https://app.terra.bio/#workspaces/help-gatk/GATKTutorials-BigQuery-July2019
I hope this helps,
Sushma
Basically, there are no files from the 1000 Genomes project available in Terra's public library?
Oliver,
Yes, that is correct - Terra does not physically store any files from the 1000 Genomes project. Terra's Data Explorer allows users to explore the data that is hosted by 1000 Genomes and then export the metadata that points to the location of the physical files.
Please sign in to leave a comment.