Efficient way to 'Select samples to process' in Terra?

Post author
Sehyun Oh

Hi! I really like that Terra offers a function to select specific samples to process - if this was in FireCloud, I couldn't find it at least, I guess. I'm wondering whether there is any efficient way to select a subset of samples instead of clicking every single boxes. For example, I have 1200 samples from a specific disease cohort and want to process multiple workflows on only 500 of those samples. What will be the best way to do this? Thanks!

Comments

7 comments

  • Comment author
    Sushma Chaluvadi
    • Edited

    Hi Sehyun!

    There definitely is a way to set this up! While there isn't a specific button to do this, you can add a table to your Data tab that indicates the 500 samples as 1 set. This way, when you go to your Tool and select data, you will be able to select the specific set of 500 with 1 check box.

    Here is an example of how you can create a sample_set table:

    This is an example screenshot of a .tsv that is uploading 2 different sets of data to the Data Model. One set is Test-cohort-A100 and the second set it Test-cohort-B500. You can see that the sample ID is listed in the second column and the set name is indicated in the first column. The second column's sample IDs should be the IDs associated with your individual samples - from your sample table.

    When this is uploaded to the Data Model it will show a "sample_set" table with 2 rows. The rows will be labeled with the set name and the second column will contain a link that shows how many samples are present in the set as seen below:

    Note: the above table is an example independent of what you would see if you were to use the exact .tsv above. 

    Once the sample-set table is uploaded, you can "Select Data" from the Tool page, and choose either Test-cohort-A100 or Test-cohort-B500, for example, based on which set you want to run!

    Hope this helps but feel free to reply back with further questions that you may have.

    0
  • Comment author
    Sehyun Oh

    This seems pretty neat, but I can't make it work for now. This is the input .tsv file I tried:

    entity.sample_id	participant	sample_type	tcga_sample_id	WXS_bai_path	WXS_bam_analysis_id	WXS_bam_path
    OV-04-1332-NB OV-04-1332 NB TCGA-04-1332-10 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1332-10A-01W.2.bam.bai 8dd89d83-b823-439d-a6dd-9e5386bf8887 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1332-10A-01W.2.bam
    OV-04-1332-TP OV-04-1332 TP TCGA-04-1332-01 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1332-01A-01W.2.bam.bai 46e76235-82b2-4608-857e-1dc901f69c42 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1332-01A-01W.2.bam
    OV-04-1335-NT OV-04-1335 NT TCGA-04-1335-11 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-04-1335-11A-01W-0489-09.bam.bai b1e0f9f2-000f-48ac-8217-cb561791c13d gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-04-1335-11A-01W-0489-09.bam
    OV-04-1335-TP OV-04-1335 TP TCGA-04-1335-01 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-04-1335-01A-01W-0488-09.bam.bai c8023d9d-c5f2-46d3-82c2-47b334d95c99 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-04-1335-01A-01W-0488-09.bam
    OV-04-1336-NT OV-04-1336 NT TCGA-04-1336-11 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1336-11A-01W.3.bam.bai 07e956a1-1bbf-471b-97cb-71e773ce2f6b gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1336-11A-01W.3.bam
    OV-04-1336-TP OV-04-1336 TP TCGA-04-1336-01 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1336-01A-01W.2.bam.bai fda47272-4eca-48f6-8b4b-13bff8601969 gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/C239.TCGA-04-1336-01A-01W.2.bam

    I'm still getting this error message:

    File does not start with entity or membership definition.

    Do you have any idea how to fix this? Thanks!

    0
  • Comment author
    Adelaide Rhodes

    Hi there Sehyun Oh.

    I think I see the error.  It is in the "entity.sample_id" heading.  I believe if you change it to "entity:sample_id" it will work.

    Try that and let us know!

    0
  • Comment author
    Sushma Chaluvadi

    Yes!

    Assuming that the above .tsv is your sample table .tsv, I have created a sample_set .tsv example you can use to create a set (or sets) here is how I would organize the .tsv to upload successfully:

     

    The top row has to be in the format "entity:your_entity_name_id" where you fill in your_entity_name.  The exact phrases "entity:" and "_id" are required but what you choose to name your table is up to you (replace "your_entity_name").

     

    "entity" is the terminology used when you are uploading independent samples/participants etc. Because you are creating a set, you would use "membership" instead of "entity". Therefore the first column in your header must be formatted as follows:

    membership:sample_set_id

    When you upload the .tsv to your Data Model, the name of the table will be "sample_set". If you wish to name the table something other than "sample_set" you are free to do so but we recommend "sample_set" so you know what the table's contents include.

    The above .tsv screenshot would result in 1 set named Set1 containing 6 samples (as listed in the second column). The first column should be the name of the set that you want to organize the participants in and the second column should be the sample IDs which I took from the .tsv excerpt you provided above.

    Here is a link to documentation that explains more details and also contains a great video: https://broadinstitute.zendesk.com/knowledge/articles/360025758392/en-us?brand_id=360000963592

    Hope this helps!

    0
  • Comment author
    Sehyun Oh

    Ok. Something is working but I'm quite confused why it's even working. :P

    So, I updated the column names as you guys explained and kept only two columns because membership accepts only two columns. Below is the actual .tsv file I could successfully upload.

    membership:sample_set_id sample
    test OV-04-1332-NB
    test OV-04-1332-TP
    test OV-04-1335-NT
    test OV-04-1335-TP
    test OV-04-1336-NT
    test OV-04-1336-TP

    And when I tried to run my tool on these, I actually need to choose 'Process multiple workflow from: Samples' and 'Select Data > Choose an existing set' (test in this example), instead of select 'Process multiple workflow from: Sample Set'. I'm quite confused with how/why this is working. It would be nice if you can clarify Data Model elements that can explain this procedure. 

    Btw, the documentation/ tutorial video link above is not working. ;(

    0
  • Comment author
    Sushma Chaluvadi

    Hi Sehyun,

    I think a walk through example might be the best way to explain! First, I have used the above 6 participants from your example .tsv and their details to put together an example of each .tsv and the resulting Data Model visual when you upload successfully. 

    The first screenshot is of the sample.tsv file you uploaded:

    This shows the header "entity:sample_id" (the bold parts are required syntax when uploading a table with independent samples, participants etc) and contains 6 samples with the extra columns of metadata (participant, sample type, etc). If you upload this table you should see the following table with the name "sample" in the left hand bar under Tables (Note: some of the additional columns got cut off in the screenshot but exist to the right of the image):

    The following .tsv screenshot is what I would use if I wanted to split the above 6 samples into 2 groups. I have chosen to call the 2 grouos "set1" and "set2":

    You can see in the .tsv that the first column names the 2 sets and the second column designates which sample belongs to which set. Each row is to assign 1 sample to a set. In this case, "set1" contains the samples "1332"NB/TP and "1335"NB/TP and "set2" contains the samples "1336"NB/TP. Note here that the header has to have "membership:sample_set_id". The "membership" indicates that you are creating a table where you are making groups or sets. The resulting table should be called "sample_set" under your list of Tables.

    If I upload this table, it should look like the following screenshot:

    You can see that it shows 2 rows, one for each set and the number of items in each set. If you click on the "4 items" link you should see the following which shows the 4 samples belonging to "set1":

    Next, if I want to run some of this data through a Tool, I would see the following options set in my Tool configuration:

    Here I have chosen to "Process multiple workflows from: Sample". I chose "Sample" versus "Sample Set" because I know that this Tool I am using runs analysis using 1 individual sample at a time. If my Tool required an input of multiple samples (for example when you are generating PoNs, you might need to input atleast 2 samples for the Tool to work), I would choose Sample Set. The root entity for my Tool is "Sample" which means that its required input is a single sample.

    So, how does this tie into the set1 and set2 I just made? If I press Select Data I will see the following pop-up:

    I have chosen "Choose an existing set" at which point my sample_set table will pop up and then I chose "set2" because while I am testing, I would like to just run my Tool on the 2 samples instead of the larger set1 which contains 4 samples. This choice means that I will run my Tool 2 times, 1 for each sample in set2 - independent of the first sample. 

    I hope this is more clear! Please let me know if you have any questions or clarifications :)

    Additionally, can you test this video link to see if you are able to access it now?

    1
  • Comment author
    Sehyun Oh

    Thanks Sashma!

    This walk through example is great. Also, the new link to the video is working. :)

     

    0

Please sign in to leave a comment.