Data Tables Quickstart Part 2 - Making a data table from scratch

Allie Hajian

Now that you've used a data table for workflow inputs and outputs, hopefully you can see how using them helps manage your project data-in-the-cloud. You're excited to use tables, but how how do you add a one to your workspace? We'll explore that in this section. 

Overview: How to add tables to a Terra workspace

In Part 1, you worked with an existing table. Suppose your workspace doesn't include any data tables? Even if you upload your own data to the workspace bucket, you will want to add a table to help organize the data and link to workflows for analysis. There are several different ways to add a data table to your workspace. See the options below. Then read on for step-by-step instructions to generate a table load file from scratch.  

  • Import from another workspace

    You can copy data in a table from another workspace to your own workspace by selecting the rows of data you want, clicking on the three vertical dots at the top right, and choosing "Export to workspace".
    Data-QuickStart-Part2_Export-table-to-another-workspace.png

  • Import from Gen3, the Data Library, or other external servers

    For external data resources directly connected to Terra, you'll be able to browse, select the data subset you want, and export to your workspace. Note that when you export data from the Data Library or an external source such as the Gen3 platform, they will usually show up as multiple tables of predefined entities. 

  • Add a table by making and uploading a "load file"

    Maybe there's no workspace with data in a table to copy, or you want to include a table for data you've just uploaded to your workspace bucket. You can create a table from scratch by generating a "load file" in a spreadsheet editor (outside of Terra) and uploading it by clicking on the blue + icon at the top of the Data page.
    Data-QuickStart-upload-tsv_Screen_Shot.png

    Learn how to create a TSV file from a template (you can find template TSV files here)

    Read on for instructions on how to create a TSV/load file to add a new workspace table.

2.1. Create a data table load file (TSV) from scratch

Workspace tables are like spreadsheets (columns and rows) built into the data page. So it's no surprise that you can use a spreadsheet editor to create a tsv/load file to upload as a new table. Each row corresponds to a unique entity and each column is a distinct attribute - ie. sex, age, height, bam, fasta, etc. and each row is a unique entity. 

Minimum data table requirements A workspace table must have at least two columns (an ID column and one attribute column) and two rows (the header and at least one entity). 

We 'll use the same workflow from Part 1, so the two columns in your most basic table will be the ID column and the input column (a FASTQ file).

Start by opening a blank file in your favorite spreadsheet editor

Step 1: Fill in the header

Each column in a table is a different kind of data or metadata. The load file header row specifies the workspace table column headers. 

1.1. Fill in the ID (first column) 

In part 1, we used a "specimen" table. However, we aren't limited to analyzing specimens, and in Terra, tables can be called anything that makes the most sense for your project. So in this part, we will call the entity we're studying "samples" instead. Use that in the first column header. 

Terra requires a particular format for the ID column header entity:your-entity-name_id

The parts in red (entity: and _id) must be typed in exactly as shown. You can name the entity whatever helps you organize your data, however. For example, the first column header of a table of samples would read entity:sample_id and the first column of a table of unicorns would read entity:unicorn_id.

1.2. Fill in the input file type (second column) 
We know that the workflow is looking for a FASTQ file, so we will use the variable fastq for the second column header. 

In your spreadsheet editor it will look (approximately) like this:  

Data-QuickStart_Part2_Spreadsheet-first-row.png

Step 2: Fill in the sample data (second row of the spreadsheet)

The rest of the table is the "data" corresponding to the headers. There is one row for each individual entity (sample, in this case) in your table. The simplest table includes one entity, but Terra tables can include an almost unlimited number of rows, each one its own entity. 

2.1. Fill in the sample ID (first column).
You can use any name you want for the sample ID. In a real analysis, this would be the unique ID of the sample. 

2.2. Fill in the full path to the input data file (second column).
This is the space where you will include the link to the input data file in the cloud. For the quickstart, you can use this downsampled FASTQ file (copy and paste the full path and file name) in a public bucket.

gs://terra-featured-workspaces/QuickStart/quickstart_reads_1.tastq

Check your data upload file format! When you're done filling in all four cells, your spreadsheet should look like this:
Data-QuickStart_Part2_Spreadsheet-complete.png

Step 3: Save file in "tab delimited text" format

Your editor may give you a warning, but we assure you, it's fine! 
Data-QuickStart_Part2_Save-as-Tab-delimited-text.png

Note that Terra will completely ignore the name you give the file. It's the root entity in the first column header (entity:your-table-name_id)that determines the table name in the workspace. 

2.2. Create a workspace-level resources table from scratch

The workspace resource data table (aka Workspace Data table) holds variables you might want to use in multiple workflow analyses - like the genomic reference sequence file, or a Docker container. Using the workspace data table lets you configure the data once and point to it whenever you need it. Not only will you not need to look up the file path again, but if you update the file, you only need to update in one place. 

The workspace resource data table in the Data-Tables-QuickStart looks like this:

Data-QuickStart_Part2_Workspace-data_Screen_shot.png

The first column (circled column on the left) identifies what the file is. The other (circled on the right) includes a link to the file in a Google bucket.

To copy a workspace resources data table from another workspace, you can download an existing table by clicking on the "Download TSV" link (top right) to your local machine. Then upload to a different workspace by clicking on the blue "+" icon by the TABLES column. 

Step-by-step instructions

To create a workspace resources table, you can create a  TSV file using a spreadsheet editor just like for a regular data table (above). The spreadsheet looks like this:

Data-QuickStart_Part2_Workspace-data-table-format_Screen_shot.png

Workspace resources data table formatting requirements The first row is the "Key" row. In it you will put the name of the reference (such as ref_fasta_index - for the reference FASTA index file). Note that you can only use lower-case letters, dashes, and underscores (no spaces!!).

The first column header must have the format below. Parts in red must be typed in exactly.

workspace:file-name

The second row includes links to each resource files in an accessible Google bucket. 

You can include additional information such as workspace tags. You can see all the
workspace tags in the right column of the Dashboard:
Data-QuickStart_Part2-Tags-in-dashboard_Screen_shot.png

Save your workspace resources table as a "tab delimited text" or "Tab separated values" as described above. You can use the blue "+" icon to upload to your workspace. 

2.3. Upload your TSV file to the workspace to create a new table

Click the blue "+" icon at the top right of the table column in the TABLE page of the workspace and follow the directions to upload both the samples and workspace resources tables.
Data-QuickStart-Part2_Upload-tsv.png

2.4. Run the workflow on data from the new table

Step 1: Select the input file

Notice there's now an additional table in your workspace (expand by clicking the name to check yours!):
Data-QuickStart_Part2_New-table.png
The four fields in the new table should look very familiar!

Step 2: Set up and run workflow

To run the workflow with the new data as input, select the data box in the new table and run 1_Single-input-workflow, just like in Part 1. You will need to set up the inputs and outputs in the workflow configuration form

Make sure your workflow inputs match your table! The attribute in the workflow setup (configuration) form needs to match the headings in your table exactly, or the workflow will fail. Note that when a workflow fails because it cannot find the input files, it does so immediately (it's always a great thing to check, when your submission fails before it even starts!).  

  • Data-QuickStart_Part2_Configure-new-table-inputs.png

    1. The root entity type should be "sample"
    2. The attribute field that corresponds to the variable R1_fastq should be the column header you used for the fastq (input) file 
    3. The sample_id attribute should be "this.sample_id" 

Then save and launch as before.

You can monitor your submission in the Job History page.

When your workflow is complete, expand the sample data table to see where the generated data are!

Thought questions about using load files

Imagine you upload a TSV file with an entity that already exists in the workspace. For example, the load file below. 

Data-QuickStart_Part2_Adding-specimen-Additional-question.png

  • Answer
    Uploading a new TSV file with an entity that already exists will not generate any new tables, but will add rows to an existing table.

    For example, the file above will generate one additional row in the specimen table - corresponding to the new specimen - assuming you give it a unique ID. Note that if the file includes an ID already in the existing table, Terra will overwrite the existing row when you load the new TSV.

    Notice that Terra also generates a new table column to the table if you used a different name for the FASTQ attribute:

    Data-QuickStart-Part2_Add-additional-specimen.png

A note about overwriting table rows When your TSV file has the same entity (name) as a table already in the workspace, you may get an error message when you try to upload about overwriting data (see screenshot).

Exercise2-tsv-warning_Screen_Shot.png

Note that Terra will only overwrite data rows with the same ID.  You can ignore this warning if the TSV file only contains new entities (i.e. different sample_IDs). If the load file includes different IDs, the rows will be added to the existing table.

2.5. Additional practice with tables

Try making your own data tables! Don't worry, they're easy to delete by selecting all the rows in the table and clicking on the three vertical dots.
Data-QuickStart_Delete_tables.png

What happens to data files if you delete a table?Note that if the tables include metadata (i.e. reference the URI of data files in cloud storage), deleting a data table (whether original or copied from another workspace) will not delete the primary data files

G0-smiley-icon.png Congratulations! You've completed Part 2 of the Data Tables Quickstart!

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

3 comments

  • Comment author
    Andrew Davidson
    • Edited

    I am having trouble with section 2.2 . It does not look like we have a way to upload reference tables I created from scratch?

    also the format seems odd and difficult to use.  I want to create a referenced table that contains 9 data sets. have a row with 9 very long bucket URL is hard to deal with in a spread sheet

     

    The first column name is 'workspace:ColData.csv' . the file I am uploading is a TSV, how ever the first reference file is actually a CSV

    ```

    BUCKET_URL="gs://bucketIdFromDashboard"

    $ gsutil ls "${BUCKET_URL}/" | cut -d / -f 4 | fileNames.txt

    gsutil ls "${BUCKET_URL}/" | tee urls.txt

    $ paste -s fileNames.txt urls.txt > reference_data.tsv

    ```

     

    Kind regards

    Andy

     

     

     

     

     

     

    0
  • Comment author
    Andrew Davidson

    I created a new workspace. As a short-term workaround, My reference files are in my workspace bucket so I can use 'files' to add them to a workflow. This is not going to be easy to use in the future. Every time I submit a job I will see a new entry in 'files' very quickly it will be hard for me to find my references

    it will also make it difficult to share my references. 

    Kind regards

    Andy

    0
  • Comment author
    Allie Cliffe

    Andrew Davidson The error in your first screenshot was because the TSV file you tried to upload did not begin with "workspace:<your first reference keye>" in the top left cell.

    For step-by-step instructions, see Creating workspace data tables. See if this helps!

    0

Please sign in to leave a comment.