Step 3 - Prepare data for submission to AnVIL

Allie Cliffe
  • Updated

Use the data model (step 2) as an outline for generating TSVs for all the tables in your data model.

Summary

Before loading data into your staging workspace (step 4), you’ll organize all required data and metadata in a format compatible with AnVIL. In this step, you'll prepare for data ingestion by generating a TSV for each table in your data model using the spreadsheet editor of your choice and saving as tab separated values (TSV) format.

PrerequisitesHas your data been approved?
This step assumes that the study has been approved by the AnVIL review board to be hosted in AnVIL through dbGaP. See Step 1 -  Register Study/Obtain Approvals for more details and step-by-step instructions. 

Have you set up your Data Model? 
See Step 2: Set up a data model for help creating your Data Dictionary, which defines all the tables in your data model. For a template Data Dictionary with all required and suggested tables, click here. To download the AnVILDataSubmissionFindabilitySubsetSchema.template.xlsx file, click on the three-dot icon at the top right and then click Download

Submission data overview

AnVIL accepts two types of data

  • Data object files (omics and images)
    Examples of object files are genomic and other omics files, and images. These will be stored in Google Buckets (GCS), which can include the staging workspace Bucket, a different Terra workspace Bucket, or external GCS. Note that in addition to the data files, AnVIL requires minimal metadata for all object files, some of which is generated by the AnVIL (i.e., full path to the files in GCS).
  • Tabular data (hold clinical data including phenotypes and object files metadata)
    • Biosample, Donor and Files tables from your data model
    • Submitted in TSV or CSF format (see requirements below)

Most studies are submitting both.

To prepare data for submission, you will

  1. Make sure all object files (omics data/images) conform to AnVIL’s naming requirements
  2. Determine how to reference data object files in GCS
  3. Generate a spreadsheet-like file (TSV format) for each table in your data model (from Step 2: Set up a data model

Before you start: Register for a Terra Account

Before your data is ingested to AnVIL (TDR), you’ll organize and store all tables and object files from your data dictionary in an AnVIL data-deposit workspace on Terra. You will need a Terra Account to stage data in the data deposit workspace.

If you do not already have a Terra account, you’ll find step-by-step instructions in How to register on Terra (Google SSO

Registering for a Terra account is free and the AnVIL pays all costs associated with uploading and storing your data.

Note that to complete an analysis, you will need to connect a Google Billing Account to your Terra account. See How to set up billing in Terra (GCP) for more information.

3.1. Make sure data file objects conform to AnVIL naming requirements

You may provide AnVIL with data object files such as image files and genomic and other omics data (VCFs, CRAMs, BAMs, IDATs, or FASTQs). These can be uploaded to the staging workspace in Step 4: Stage Data or stored in external Google Cloud Buckets or a different Terra workspace.

Object file name formatting requirementsObject file names may only contain numbers, letters, colons (:), dashes (-), and underscores (_).

Special characters cannot be used in any field or file name. If your files contain special characters (i.e., “%” or “*”) , you must remove/replace them before ingestion.

3.2. Referencing data object files (like .cram or .bam files) in GCS

You do not need to move all of the data files for your dataset into the staging workspace in order for them to be referenced in TDR. They simply need to exist in a Google Cloud Storage (GCS) bucket where the TDR dataset can be granted read access (see guidance for this below).

How you reference the files in your load file TSVs depends on where they will be stored

  • Option 1: Files will be stored in the staging workspace Bucket (you’ll upload them in step 2)
  • Option 2: Files are/will be stored outside the staging workspace Bucket (in another workspace or external GCS Bucket)
  • Option 1: Data file objects will be uploaded to the staging workspace Bucket

    If your data object files are not already stored in the cloud (external GCS or a different Terra workspace) and you will use the Data Uploader (recommended) to upload them to the staging workspace (Step 4), you will reference the file name only in your TSV file. See the blue-shaded cells in the example below.

    Example biosample.tsv (what you submit to Data Uploader)

    biosample_id

    crai cram donor_id
    AC101-2025 AC101.crai AC101.cram AC101
    NA12878_2024 NA12878.crai NA12878.cram NA12878
  • Option 2: Data file objects are/will be stored outside the staging workspace Bucket

    Note that in this case, you will need to include the full path (i.e., gs://<bucket name>/<file name>) in the TSV before uploading to your staging workspace.

    Example biosample.tsv (data files in external GCS Buckets)

    biosample_id

    crai cram donor_id
    AC101-2025 gs://your-bucket-id/AC101.crai gs://your-bucket-id/AC101.cram AC101
    NA1287_2024 gs://your-bucket-id/NA1287.crai gs://your-bucket-id/NA1287.cram NA1287

3.3. Generate table load files (TSV format)

In this step you'll create spreadsheet-like files (TSV format) for each table in your data model. You can format existing tabular data into a TSV file and submit it that way (for example, for phenotypic or demographic donor data), or you can create new spreadsheets/TSV files to put your data into.

Templates and examples spreadsheets

How to use the templatesClick view raw to download the template or example spreadsheet in .xlsx format. The template includes separate sheets for biosample, donor, file, condition, activity and project tables. The first three (in bold) are required and the second three are recommended. You can also include additional tables to hold data unique to your dataset. You can modify the spreadsheet in the editor of your choice and save each table as tab-separated values (.tsv) or tab-delimited text (.txt) to upload to the staging workspace. 

A video walkthrough of generating a load file (TSV format) from a template is available below. 

Formatting requirementsUnallowed characters
Your spreadsheets may only contain numbers, letters, colons (:), dashes (-) and underscores (_). No special characters (&, $, %, #, etc.) are allowed in any fields of the load file. If your files contain special characters (i.e., “%” or “*”) , you must remove/replace them before ingestion.

Each table must start with a column titled [tablename]_id
Each row in the table must have a unique foreign ID key. These keys will be used to associate data in different tables (i.e., biosample_id, donor_id and file_id). IDs should not contain ^ as a character.

Multiple values, where allowed, should be separated by '^'
For example, an associated field with 3 values would look like: value1^value2^value3.

How large can your tables (TSVs) be?

Terra can generally handle tables of up to 100, 000 rows without any impact on performance.

Please contact us if you have tables with more than 100,000+ rows. AnVIL users should use your support person, if known. Otherwise contact us through the contact us form in Terra (Main menu > Support).

Example: Donor table in a spreadsheet editor

donor_id biosample_id donor_type age hdl height ldl
HG000096 bio12389021 human 76 89.34 179 124.81
HG000097 bio70538276 human 64 62.25 159 120.32

Associating Data in Different Tables

The key, or ID, column is used to associate data (link tables). For example, the biosample is associated with its donor by the donor_id column in the biosample table (blue shade) or the biosample_id in the donor table (greed shade).

biosample_id anatomical_site apriori_cell_type biosample_type disease donor_id
bio123890 blood abnormal cell Blood leukemia HG000097

Where possible, try to include data in the donor, biosample, or file tables If that’s not an option, the data can be submitted as additional, separate tables.

Any data beyond these minimal required tables must always be linked to either the donor_id, biosample_id, family_id, or file_id - depending on what the data element describes. For example, to link data in an additional table to a donor, make sure to include a donor_id column.

Addressing repeated elements

Please bring any repeating data elements (i.e., multiple values for a given data element for an individual) to the attention of the AnVIL team to ensure proper modeling and submission.

Examples of repeating elements

  • An individual in a data set (“donor”) has a measurement (e.g., blood pressure, lab test, BMI) taken at multiple time points.
  • An individual in a data set (“donor”) is affected by multiple disease/phenotype/conditions included in the study (e.g., an individual in a diabetes study has both diabetes and diabetes retinopathy; both are being tracked in the study).

3.4. Save as "Tab-Delimited Text" or "Tab-Separated Values"

Save each table TSV as a separate file in tab-delimited text or tab-separated values format. Your spreadsheet editor may give you a warning about losing data, but we assure you, it's fine!

Table names

In general, AnVIL will ignore the name you give the TSV file and will use the key name in the first column header (the part in front of the _id) for the table name when you upload tables to the staging workspace.

TSV versus TXT File Extensions

Depending on what spreadsheet editor you use, when you save in the proper format your spreadsheet may have either a ".tsv" or a ".txt" extension. AnVIL will accept either one.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.