Once you have prepared your omics object files and generated TSV files for each table in your data model, follow the directions below to deposit the data object files and all TSV files into AnVIL-owned data submission workspaces. Before proceeding to Step 5 - Ingest data into TDR, you'll QC the data in the submission workspace to make sure it conforms to all AnVIL standards and avoid problems during the ingestion.
4.1. Log into Terra/AnVIL
4.1.1. Go to https://anvil.terra.bio/#workspaces and log in with your Terra ID. If you have logged into Terra before, you will use the same login ID.
4.1.2. The AnVIL Data Ingestion team will provide you with your submission workspace. Once logged into Terra, search for your workspace in Your Workspaces and click on the link to open it.
4.2. Set up the submission workspace Bucket
Who can skip this stepNote that all data file objects need to exist in GCS, not necessarily in the submission workspace storage.
If your data file objects are already stored in an external GCS Bucket or a different Terra workspace, you can skip this step and proceed to 4.4. Upload tabular data to the submission workspace (object files in external storage option).
Submission workspace file directory requirements
To facilitate ingestion into TDR using the Data Uploader (recommended), the submission workspace cloud storage needs to have a particular directory structure (a top-level “Uploads” directory). While there is no additional required directory structure for data files, we generally recommend using a sub-directory of "data_files" or something similar to make navigating the submission workspace storage a little easier going forward.
Step-by-step instructions
4.2.1. Go to the workspace Files directory (click the data file icon in the right sidebar from the Data page).
4.2.2. Make an “Uploads” folder to upload into.
4.2.3. You can make additional sub-folders (such as “data_files”) to help manage data in the Bucket.
Note that you can create subfolders (such as data_files in the screenshot above) under Uploads to help keep data organized.
4.3. Upload data object files to the Uploads folder
Who can skip this stepYou'll need to perform this step if you'll be storing your data object files (such as CRAMs, BAMs, or VCFs) in the submission workspace Bucket. Note that you can store your data in whatever GCS you prefer, but storage costs for the submission workspace Bucket (for approved studies) are covered by AnVIL.
Considerations when uploading data to the submission workspace
- You may use any mechanism you prefer to upload your data file objects to the submission workspace bucket, including the Data Uploader, gcloud storage, gsutil, or the in-app uploader.
- All data file objects must have an md5 recorded in their GCS object metadata prior to the push of data into TDR, otherwise the push will fail.
- Certain file upload methods (such as parallel composite upload) can result in an md5 not being recorded in the GCS object metadata, which prevents a file from being properly ingested into TDR.
Use gcloud storage for uploading
If you don’t have a favorite uploading tool, we recommend the gcloud storage command line interface. See step-by-step instructions here.
Step-by-step instructions
Option 1: Large numbers and/or large files (typical case)
We suggest using gcloud storage in a local terminal to upload data object files. See How to move data to/from a Google Bucket (large numbers/large files) for step-by-step instructions.
Option 2: Small numbers and small files
If you have a small number of small (<some size>) files, you can upload files directly in the submission workspace. For step-by-step instructions, go to 4.5. Upload tabular data and follow the full instructions in Upload data and populate the table with linked file paths. In this case, you will use the data uploader to upload both the data object files and tables (TSV's) together.
Skip to 4.5. Upload tabular data below.
4.4. Verify the upload created an md5
To check the md5 population, you can either examine the GCS metadata directly or run the CreateWorkspaceFileManifest workflow (in the Workflows tab). The workflow creates a file_metadata table in your submission workspace with an md5_hash column you can review to make sure it is populated for all data files.
Note that you can then delete the file_metadata table if you don't want to include it in the dataset (if you already have a file metadata table, for instance).
4.5. Upload tabular data to tables in the submission workspace
Once your object data files are in the submission workspace or external GCS Bucket, you will upload all tabular data (TSVs) to the workspace. Uploading the TSVs as workspace tables ensures they are correctly formatted.
Step-by-step instructions
How you upload TSV files depends on where your data object files are stored
- Option 1: Data file objects will be stored in the submission workspace
- Option 2: Data object files are stored in a different workspace / external GCS
Choose the tab below with the correct instructions
-
The Data Uploader tool (recommended) uploads all tables (TSV files) and changes the data object file names in the TSVs to full paths to the data object files in the submission workspace Bucket.
To use the Data Uploader, see How to upload data and populate the data table with file links.
If you uploaded data object files with gcloud storage in step 4.2 above
You can skip step 3 in the step-by-step instructions linked above. As long as you uploaded files to an Uploads directory, the Data Uploader should work properly.
If you are adding small numbers of small files
Follow all four steps in the Data Uploader documentation.
-
If your data file objects are stored outside the submission workspace, you can upload all the TSVs in your data model right in the workspace, following step-by-step instructions below. Make sure your TSVs include full paths to the files in GCS (i.e.,
gs://your-bucket-name/NA12078.cram
instead ofNA12078.cram
.4.5.1. Click the Import Data button at the top left of the workspace Data page.
4.5.2. Select Upload TSV and follow the prompts.
4.5.3. Repeat for all tables in your data model.
4.6. Validate submitted data
To validate the staged data prior to pushing to TDR, you’ll create a data dictionary of your data model and run the TerraSummaryStatistics workflow with the data dictionary as input.
What to expect
The Summary workflow checks the data in tables against the data dictionary. It will add QC columns that can be checked against the expected (from the data dictionary). If there are flags (discrepancies), you will update the data tables, then run the workflow again to see if you’ve removed some of the flags.
Create a data dictionary for your data model
4.6.1. Generate a Data Dictionary in a spreadsheet editor. The data dictionary is a single TSV with a row for every column in every table in your data model. Each row includes information about what data and format to expect and other useful metadata for each table attribute.
Data Dictionary required/suggested entries
Required columns (in bold font) include the table_name, column_name, label, text description, primary_key (true/false), and whether the attribute is required.
Column Name |
Description |
Example Value |
table_name | The name of the table the column is a part of. | "bio sample" |
column_name | The name of the column being defined | "biosample_id" |
label | A human-readable label for the column. | "Sample identifier" |
description | A text description of the column. | “The unique identifier for the biosample” |
primary_key |
Indicates whether the column is the primary key of the table or not. Allowed Values: TRUE, FALSE |
false |
refers_to_column | The table and column the column refers back to, if a foreign key column. Denoted as “<table>.<column>” (separated by a period). | donor.donor_id |
required |
Indicates whether the column is required or not. If required, it is expected that the column will not contain null values. Allowed Values: TRUE, FALSE |
true |
data_type |
The expected data type of the column. Allowed Values: boolean, float, int, string, fileref |
string |
multiple_values_allowed |
Indicates whether the column may contain arrays or not. Allowed Values: TRUE, FALSE |
false |
allowed_values_list |
A comma-separated list of values that are allowed to be present in the column. If specified, it is expected that the column will not contain non-null values outside of this list. |
Male, Female |
allowed_values_pattern |
A regular expression pattern that the values of the column are expected to match. If specified, it is expected that the column will not contain non-null values that don’t match the pattern |
^SUB[0-9]{6}$ |
Example Data Dictionary
The example below corresponds to a part of the Data Dictionary for the simple toy model in Step 2 - Set up Data Model. Note that this truncated example includes the first four columns for some of the attributes in the biosample and donor tables. A real Data Dictionary will have many more rows and columns.
table_name |
column_name |
label |
description |
biosample | biosample_id | biosample identifier | Text description of unique biosample ID |
biosample | donor_id | donor_identifier | ID of donor associated with the biosample |
biosample | disease | disease | Disease diagnosis for biosample |
biosample | disease_code | disease code | Ontology code corresponding to disease found in biosample |
... | |||
donor | donor_id | donor identifier | ID corresponding to the donor in the study |
... |
Run the validation workflow
To validate the tabular data uploaded in 4.5 above, you'll use this data dictionary as input to the TerraSummaryStatistics WDL (included in the submission workspace).
4.6.2. Upload the data dictionary TSV (step 4.6.1) to the home directory in the workspace bucket (click the Files icon in the right-sidebar of the Data tab to expose the workspace Bucket file directory seen below).
4.6.3. In the workflows tab, click on the TerraSummaryStatistics workflow and confirm the billing_project and workspace_name variables in the workflow configuration are properly pointing to the submission workspace.
4.6.4. Ensure Run workflow with inputs defined by file paths is selected (1).
4.6.5. Update the data_dictionary_file variable in the workflow configuration as a double-quotation enclosed GS path to the data dictionary file uploaded in the previous step (e.g., “gs://fc-secure-3828f6e6-f78d-487-a649-05ae9701b6/data_dictionary.tsv”
).
How to find the full path to the data dictionary
Note that you can browse the workspace storage Bucket by clicking on the files icon to the far right of the data_dictionary_file variable (2). Click on the data dictionary TSV file you just uploaded.
4.6.6. Save the configuration (3).
4.6.6. Then click the blue Launch button to the right of the Outputs tab to kick off the workflow.
What to expect
Terra will write an output TSV to the same location as the data dictionary TSV, with “.summary_stats.<YYYYMMDD>” appended to the file name. The summary stats TSV will include useful information about the data uploaded to the submission workspace (see the screenshot below with the additional information columns showing at the right side of the table).
4.6.7. Download and open this TSV and review its contents. The additional columns as well as guidance for how to address issues, are listed below.
inferred_data_type
-
- Description: The data type that will be inferred for the column by the workflow that pushes the data to TDR. See the workflow README for a mapping between input data types and TDR data types.
- Flags if: The input “data_type” does not match the “inferred_data_type” (when considering the mapping from input data types to TDR data types OR the input “data_type” value is not in the Allowed Values list.
- What to do if flagged: Update the input “data_type” if appropriate, or update the column values to properly reflect the input “data_type”.
inferred_mulitple_values_allowed
-
- Description: Whether or not the workflow that pushes the data to TDR will infer the column to allow arrays or lists.
- Flags if: The input “multiple_values_allowed” value does not match the “inferred_mulitiple_values_allowed” value.
- What to do if flagged: Update the input “multiple_values_allowed” if appropriate, or update the column values to properly reflect the input “multiple_values allowed”.
record_count
-
- Description: The count of rows or records in the table the column is a part of.
- Flags if: N/A
- What to do if flagged: N/A
null_value_count
-
- Description: The count of rows or records in which the column contains a null value.
- Flags if: The input “required” indicator was set to TRUE and null values were found in the column.
- What to do if flagged: Update the input “required” indicator if appropriate, or update the column to not include null values.
unique_value_count
-
- Description: The count of unique or distinct values in the column.
- Flags if: N/A
- What to do if flagged: N/A
value_not_in_ref_col_count
-
- Description: The count of rows or records in which the column contains a value that does not properly match a value in the column referenced in the input “refers_to_column”.
- Flags if: The input “refers_to_column” specifies a <table>.<column> combination not present in the workspace OR the column being evaluated contains values not present in the <table>.<column> specified in the input “refers_to_column”.
- What to do if flagged: Update the input “refers_to_column” to include the appropriate column, or update the table data to ensure appropriate linkage between the column being evaluated and the column specified in the input “refers_to_column”.
non_allowed_value_count
-
- Description: The count of rows or records in which the column contains a value that does not match a value specified in the input “allowed_values_list” or does not match the pattern specified in the input “allowed_values_pattern”.
- Flags if: The column being evaluated contains values not present in the input “allowed_values_list” (if specified) OR the column being evaluated contains values that do not match the pattern specified in the input “allowed_values_pattern” (if specified).
- What to do if flagged: Update the input “allowed_values_list” and/or “allowed_values_pattern” as appropriate, or update the column values to ensure they match the specified allowed value list and/or pattern.
flagged
-
- Description: Indicates whether or not any issues with the column were flagged during validation.
- Flags if: N/A
- What to do if flagged: N/A
flag_notes
-
- If “flagged” = TRUE, includes a list of issues captured for the column during validation.
- Flags if: N/A
- What to do if flagged: N/A
4.6.8. Re-run Steps 4.6.2 to 4.6.7 as needed to clear validation errors and produce a clean data dictionary and summary statistics TSV prior to pushing the data to the Terra Data Repository.
Most submitters will not get it right on their first try. If our verification WDL is doing their job, it will give enough information to help you correct any errors and try again. If you need to update or delete the data in the bucket, you can do so using gcloud storage commands.
If you have additional questions, please reach out to the AnVIL ingestion team at anvil-data@broadinstitute.org.