Format of the custom table for the flexible data model
I want to use custom tables for data models and wonder what is the format of tsv file.
For example, when I uploaded the manifest file from GDC, I got this error:
File does not start with entity or membership definition.
Also, how can I import them into my tool as inputs? Can I subset them programmatically?
Comments
7 comments
Hello Sehyun -
Your .tsv file should be a tab delimited file with columns that represent the attributes that you want to represent your data. The first column has to include:
"entity:nameOfColumn_id" or "membership:nameOfColumn_id" are two general formats for .tsv format -
a. "entity" is if you want to upload independent samples
b. "membership" is if you want to make subsets of all your independent samples.
c. the "_id" is required for the header of the first column.
In the above screenshot of an example .tsv, the data is all tab delimited. The resulting data table will be named "lane" and the "_id" is a format requirement. The following columns will be named "bam" "bai" and "gender" with the appropriate data filled in the 3 rows (Sample1, Sample2, and Sample3).
Once your data model is uploaded successfully, you can go to your Tools tab and select under "Process multiple workflows from:", you would select "Lane" which indicates to the UI that it is looking for inputs in the Lane table. Under the Attribute column, you can direct the Tool to take inputs from the Lane table by typing "this.columnName".
For example, if you want the input for Attribute to be the numbers under column "bam" for each sample (row), you would enter this.bam. This tells the Tool that for the analysis, to read in the file/value for each sample, from the column "bam", from the Lane table. Note: The .tsv above *is* tab delimited though it may not appear so.
Hope this helps!
Thanks for the quick response!
A couple things to clarify... those numbers in the table will be most likely replaced with the actual file path (gs url) in the real life, right? Also, custom table is only for your own data? If I use TCGA data, for example, I'm using the data models same as in FC?
Yes sorry, I should post a better example! The column values can be anything that you want including gs:// links to the location of the file you wish to use as input to your Tool. Correct, custom tables are for you to upload your own data as you see fit.
For TCGA data, assuming that you are exporting from the Data Library from FireCloud, it will come in whatever data table format it was organized in within the FireCloud workspace, so most likely the FireCloud data model structure!
Thanks! It makes sense. :)
One side note, can we use the files outside of gs bucket? Like AWS S3?
Currently, only gs buckets are compatible.
Can you clarify the steps for making a "sets" or a collection of entities within this flexible data model framework? What is the required format (and file / column names) for the example above? How do I make a "lane_set" with sample1 and sample2 in a collection?
Thanks!
Alisa
Hi Alisa,
If I understand correctly you want a custom table (named "lane_set") that consists of sample1 and sample2.
To set this up you would do:
col#1 col#2
membership:lane_set_id sample
set1 sample1
set1 sample2
set2 sample3
Using the above .tsv format, you will get a table named lane_set that contains 2 rows.
Row1 will say set1 and contain Sample1 and Sample2
Row2 will say set2 and contain Sample3 (just to demonstrate - not that this is an actual set of multiple samples)
Please note that the .tsv should be tab delimited with 2 colums.
Hope this helps!
Please sign in to leave a comment.