Step 4 - Stage data in submission workspace

Allie Cliffe
  • Updated

Once you have prepared your omics object files and generated TSV files for each table in your data model, follow the directions below to deposit the data object files and all TSV files into AnVIL-owned data submission workspaces. Before proceeding to Step 5 - Ingest data into TDR, you'll QC the data in the submission workspace to make sure it conforms to all AnVIL standards and avoid problems during the ingestion. 

4.1. Log into Terra/AnVIL

4.1.1. Go to https://anvil.terra.bio/#workspaces and log in with your Terra ID. If you have logged into Terra before, you will use the same login ID.

4.1.2. The AnVIL Data Ingestion team will provide you with your submission workspace. Once logged into Terra, search for your workspace in Your Workspaces and click on the link to open it. 

4.2. Set up the submission workspace Bucket

Who can skip this stepNote that all data file objects need to exist in GCS, not necessarily in the submission workspace storage.

If your data file objects are already stored in an external GCS Bucket or a different Terra workspace, you can skip this step and proceed to 4.4. Upload tabular data to the submission workspace (object files in external storage option)

Submission workspace file directory requirements

 To facilitate ingestion into TDR using the Data Uploader (recommended), the submission workspace cloud storage needs to have a particular directory structure (a top-level “Uploads” directory). While there is no additional required directory structure for data files, we generally recommend using a sub-directory of "data_files" or something similar to make navigating the submission workspace storage a little easier going forward.

Step-by-step instructions

4.2.1. Go to the workspace Files directory (click the data file icon in the right sidebar from the Data page).

4.2.2. Make an “Uploads” folder to upload into.

4.2.3. You can make additional sub-folders (such as “data_files”) to help manage data in the Bucket. 

Screenshot-of-Bucket-directory-in-Files-screen.png

Note that you can create subfolders (such as data_files in the screenshot above) under Uploads to help keep data organized. 

4.3. Upload data object files to the Uploads folder

Who can skip this stepYou'll need to perform this step if you'll be storing your data object files (such as CRAMs, BAMs, or VCFs) in the submission workspace Bucket. Note that you can store your data in whatever GCS you prefer, but storage costs for the submission workspace Bucket (for approved studies) are covered by AnVIL. 

Considerations when uploading data to the submission workspace

  • You may use any mechanism you prefer to upload your data file objects to the submission workspace bucket, including the Data Uploader, gcloud storage, gsutil, or the in-app uploader.
  • All data file objects must have an md5 recorded in their GCS object metadata prior to the push of data into TDR, otherwise the push will fail.
  • Certain file upload methods (such as parallel composite upload) can result in an md5 not being recorded in the GCS object metadata, which prevents a file from being properly ingested into TDR.

Use gcloud storage for uploading

If you don’t have a favorite uploading tool, we recommend the gcloud storage command line interface. See step-by-step instructions here

Step-by-step instructions

Option 1: Large numbers and/or large files (typical case)

We suggest using gcloud storage in a local terminal to upload data object files. See How to move data to/from a Google Bucket (large numbers/large files) for step-by-step instructions.

Option 2: Small numbers and small files

If you have a small number of small (<some size>) files, you can upload files directly in the submission workspace. For step-by-step instructions, go to 4.5. Upload tabular data and follow the full instructions in Upload data and populate the table with linked file paths. In this case, you will use the data uploader to upload both the data object files and tables (TSV's) together.

Skip to 4.5. Upload tabular data below. 

4.4. Verify the upload created an md5

To check the md5 population, you can either examine the GCS metadata directly or run the CreateWorkspaceFileManifest workflow (in the Workflows tab). The workflow creates a file_metadata table in your submission workspace with an md5_hash column you can review to make sure it is populated for all data files.

Note that you can then delete the file_metadata table if you don't want to include it in the dataset (if you already have a file metadata table, for instance).

4.5. Upload tabular data to tables in the submission workspace

Once your object data files are in the submission workspace or external GCS Bucket, you will upload all tabular data (TSVs) to the workspace. Uploading the TSVs as workspace tables ensures they are correctly formatted.

Step-by-step instructions

How you upload TSV files depends on where your data object files are stored

  • Option 1: Data file objects will be stored in the submission workspace
  • Option 2: Data object files are stored in a different workspace / external GCS

Choose the tab below with the correct instructions 

  • The Data Uploader tool (recommended) uploads all tables (TSV files) and changes the data object file names in the TSVs to full paths to the data object files in the submission workspace Bucket.

    To use the Data Uploader, see How to upload data and populate the data table with file links.

    If you uploaded data object files with gcloud storage in step 4.2 above

    You can skip step 3 in the step-by-step instructions linked above. As long as you uploaded files to an Uploads directory, the Data Uploader should work properly.

    If you are adding small numbers of small files

    Follow all four steps in the Data Uploader documentation.

  • If your data file objects are stored outside the submission workspace, you can upload all the TSVs in your data model right in the workspace, following step-by-step instructions below. Make sure your TSVs include full paths to the files in GCS (i.e., gs://your-bucket-name/NA12078.cram instead of NA12078.cram.

    4.5.1. Click the Import Data button at the top left of the workspace Data page.

    4.5.2. Select Upload TSV and follow the prompts.

    4.5.3. Repeat for all tables in your data model.

4.6. Validate submitted data 

To validate the staged data prior to pushing to TDR, you’ll create a data dictionary of your data model and run the TerraSummaryStatistics workflow with the data dictionary as input.

What to expect

The Summary workflow checks the data in tables against the data dictionary. It will add QC columns that can be checked against the expected (from the data dictionary). If there are flags (discrepancies), you will update the data tables, then run the workflow again to see if you’ve removed some of the flags.

Create a data dictionary for your data model

4.6.1. Generate a Data Dictionary in a spreadsheet editor. The data dictionary is a single TSV with a row for every column in every table in your data model. Each row includes information about what data and format to expect and other useful metadata for each table attribute.

Data Dictionary required/suggested entries

Required columns (in bold font) include the table_name, column_name, label, text description, primary_key (true/false), and whether the attribute is required

Column Name

Description

Example Value

table_name The name of the table the column is a part of. "bio sample"
column_name The name of the column being defined "biosample_id"
label A human-readable label for the column. "Sample identifier"
description A text description of the column. “The unique identifier for the biosample”
primary_key

Indicates whether the column is the primary key of the table or not.

Allowed Values: TRUE, FALSE

false
refers_to_column The table and column the column refers back to, if a foreign key column. Denoted as “<table>.<column>” (separated by a period). donor.donor_id
required

Indicates whether the column is required or not. If required, it is expected that the column will not contain null values.

Allowed Values: TRUE, FALSE

true
data_type

The expected data type of the column.

Allowed Values: boolean, float, int, string, fileref

string
multiple_values_allowed

Indicates whether the column may contain arrays or not.

Allowed Values: TRUE, FALSE

false
allowed_values_list

A comma-separated list of values that are allowed to be present in the column. If specified, it is expected that the column will not contain non-null values outside of this list.

Male, Female
allowed_values_pattern

A regular expression pattern that the values of the column are expected to match. If specified, it is expected that the column will not contain non-null values that don’t match the pattern

^SUB[0-9]{6}$

Example Data Dictionary

The example below corresponds to a part of the Data Dictionary for the simple toy model in Step 2 - Set up Data Model. Note that this truncated example includes the first four columns  for some of the attributes in the biosample and donor tables. A real Data Dictionary will have many more rows and columns. 

table_name

column_name

label

description

biosample biosample_id biosample identifier Text description of unique biosample ID
biosample donor_id donor_identifier ID of donor associated with the biosample
biosample disease disease Disease diagnosis for biosample
biosample disease_code disease code Ontology code corresponding to disease found in biosample
...      
donor donor_id donor identifier ID corresponding to the donor in the study
...      

Run the validation workflow

To validate the tabular data uploaded in 4.5 above, you'll use this data dictionary as input to the TerraSummaryStatistics WDL (included in the submission workspace).

4.6.2. Upload the data dictionary TSV (step 4.6.1) to the home directory in the workspace bucket (click the Files icon in the right-sidebar of the Data tab to expose the workspace Bucket file directory seen below). 

TDR-self-servic-ingest_Screenshot-of-data-dictionary-in-staging-workspace-Bucket-with-arrow-pointing-to-the-upload-button.png

4.6.3. In the workflows tab, click on the TerraSummaryStatistics workflow and confirm the billing_project and workspace_name variables in the workflow configuration are properly pointing to the submission workspace.

4.6.4. Ensure Run workflow with inputs defined by file paths is selected (1).

4.6.5. Update the data_dictionary_file variable in the workflow configuration as a double-quotation enclosed GS path to the data dictionary file uploaded in the previous step (e.g., “gs://fc-secure-3828f6e6-f78d-487-a649-05ae9701b6/data_dictionary.tsv”).

How to find the full path to the data dictionary

Note that you can browse the workspace storage Bucket by clicking on the files icon to the far right of the data_dictionary_file variable (2). Click on the data dictionary TSV file you just uploaded. 

4.6.6. Save the configuration (3). 
TDR-self-service-ingest_Screenshot-of-TerraSummaryStatistics-configuration-pane-with-data_dictionary_file-input-valiable-circled.png

4.6.6. Then click the blue Launch button to the right of the Outputs tab to kick off the workflow.

What to expect

Terra will write an output TSV to the same location as the data dictionary TSV, with “.summary_stats.<YYYYMMDD>” appended to the file name. The summary stats TSV will include useful information about the data uploaded to the submission workspace (see the screenshot below with the additional information columns showing at the right side of the table). 
TDR-self-service-ingest_Example-summary-stats-tsv.png

4.6.7. Download and open this TSV and review its contents. The additional columns as well as guidance for how to address issues, are listed below.

inferred_data_type

    • Description: The data type that will be inferred for the column by the workflow that pushes the data to TDR. See the workflow README for a mapping between input data types and TDR data types.
    • Flags if: The input “data_type” does not match the “inferred_data_type” (when considering the mapping from input data types to TDR data types OR the input “data_type” value is not in the Allowed Values list.
    • What to do if flagged: Update the input “data_type” if appropriate, or update the column values to properly reflect the input “data_type”.

inferred_mulitple_values_allowed

    • Description: Whether or not the workflow that pushes the data to TDR will infer the column to allow arrays or lists.
    • Flags if: The input “multiple_values_allowed” value does not match the “inferred_mulitiple_values_allowed” value.
    • What to do if flagged: Update the input “multiple_values_allowed” if appropriate, or update the column values to properly reflect the input “multiple_values allowed”.

record_count

    • Description: The count of rows or records in the table the column is a part of.
    • Flags if: N/A
    • What to do if flagged: N/A

null_value_count

    • Description: The count of rows or records in which the column contains a null value.
    • Flags if: The input “required” indicator was set to TRUE and null values were found in the column.
    • What to do if flagged: Update the input “required” indicator if appropriate, or update the column to not include null values.

unique_value_count

    • Description: The count of unique or distinct values in the column.
    • Flags if: N/A
    • What to do if flagged: N/A

value_not_in_ref_col_count

    • Description: The count of rows or records in which the column contains a value that does not properly match a value in the column referenced in the input “refers_to_column”.
    • Flags if: The input “refers_to_column” specifies a <table>.<column> combination not present in the workspace OR the column being evaluated contains values not present in the <table>.<column> specified in the input “refers_to_column”.
    • What to do if flagged: Update the input “refers_to_column” to include the appropriate column, or update the table data to ensure appropriate linkage between the column being evaluated and the column specified in the input “refers_to_column”.

non_allowed_value_count

    • Description: The count of rows or records in which the column contains a value that does not match a value specified in the input “allowed_values_list” or does not match the pattern specified in the input “allowed_values_pattern”.
    • Flags if: The column being evaluated contains values not present in the input “allowed_values_list” (if specified) OR the column being evaluated contains values that do not match the pattern specified in the input “allowed_values_pattern” (if specified).
    • What to do if flagged: Update the input “allowed_values_list” and/or “allowed_values_pattern” as appropriate, or update the column values to ensure they match the specified allowed value list and/or pattern.

flagged

    • Description: Indicates whether or not any issues with the column were flagged during validation.
    • Flags if: N/A
    • What to do if flagged: N/A

flag_notes

    • If “flagged” = TRUE, includes a list of issues captured for the column during validation.
    • Flags if: N/A
    • What to do if flagged: N/A

4.6.8. Re-run Steps 4.6.2 to 4.6.7 as needed to clear validation errors and produce a clean data dictionary and summary statistics TSV prior to pushing the data to the Terra Data Repository.

Most submitters will not get it right on their first try.  If our verification WDL is doing their job, it will give enough information to help you correct any errors and try again.  If you need to update or delete the data in the bucket, you can do so using gcloud storage commands. 

If you have additional questions, please reach out to the AnVIL ingestion team at anvil-data@broadinstitute.org.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.