Step 3 - Prepare data for submission to AnVIL

Use the data model (step 2) as an outline for generating TSVs for all the tables in your data model.

Summary

Before loading data into your staging workspace (step 4), you’ll organize all required data and metadata in a format compatible with AnVIL. In this step, you'll prepare for data ingestion by generating a TSV for each table in your data model using the spreadsheet editor of your choice and saving as tab separated values (TSV) format.

PrerequisitesHas your data been approved?
This step assumes that the study has been approved by the AnVIL review board to be hosted in AnVIL through dbGaP. See Step 1 - Register Study/Obtain Approvals for more details and step-by-step instructions.

Have you set up your Data Model?
See Step 2: Set up a data model for help creating your Data Dictionary, which defines all the tables in your data model. For a template Data Dictionary with all required and suggested tables, click here. To download the AnVILDataSubmissionFindabilitySubsetSchema.template.xlsx file, click on the three-dot icon at the top right and then click Download.

Submission data overview

AnVIL accepts two types of data

Data object files (omics and images)
Examples of object files are genomic and other omics files, and images. These will be stored in Google Buckets (GCS), which can include the staging workspace Bucket, a different Terra workspace Bucket, or external GCS. Note that in addition to the data files, AnVIL requires minimal metadata for all object files, some of which is generated by the AnVIL (i.e., full path to the files in GCS).
Tabular data (hold clinical data including phenotypes and object files metadata)
- Biosample, Donor and Files tables from your data model
- Submitted in TSV or CSF format (see requirements below)

Most studies are submitting both.

To prepare data for submission, you will

Make sure all object files (omics data/images) conform to AnVIL’s naming requirements
Determine how to reference data object files in GCS
Generate a spreadsheet-like file (TSV format) for each table in your data model (from Step 2: Set up a data model.

Before you start: Register for a Terra Account

Before your data is ingested to AnVIL (TDR), you’ll organize and store all tables and object files from your data dictionary in an AnVIL data-deposit workspace on Terra. You will need a Terra Account to stage data in the data deposit workspace.

If you do not already have a Terra account, you’ll find step-by-step instructions in How to register on Terra (Google SSO.

Registering for a Terra account is free and the AnVIL pays all costs associated with uploading and storing your data.

Note that to complete an analysis, you will need to connect a Google Billing Account to your Terra account. See How to set up billing in Terra (GCP) for more information.

3.1. Make sure data file objects conform to AnVIL naming requirements

You may provide AnVIL with data object files such as image files and genomic and other omics data (VCFs, CRAMs, BAMs, IDATs, or FASTQs). These can be uploaded to the staging workspace in Step 4: Stage Data or stored in external Google Cloud Buckets or a different Terra workspace.

Object file name formatting requirementsObject file names may only contain numbers, letters, colons (:), dashes (-), and underscores (_).

Special characters cannot be used in any field or file name. If your files contain special characters (i.e., “%” or “*”) , you must remove/replace them before ingestion.

3.2. Referencing data object files (like .cram or .bam files) in GCS

Where will your data be stored?You do not need to move all of the data files for your dataset into the staging workspace in order for them to be referenced in TDR. They simply need to exist in a Google Cloud Storage (GCS) bucket where the TDR dataset can be granted read access (see guidance for this below).

Note that AnVIL covers the storage costs for approved AnVIL datasets stored in the AnVIL data deposit workspace.

How you reference the files in your load file TSVs depends on where they will be stored

Option 1: Files will be stored in the staging workspace Bucket (recommended). In this case, cloud storage is paid by AnVIL. You’ll upload them in Step 4 - Stage data in submission workspace.
Option 2: Files are/will be stored outside the staging workspace Bucket (in another workspace or external GCS Bucket). Note that you will pay the cloud storage costs.

Option 1: Data file objects will be uploaded to the staging workspace Bucket

If your data object files are not already stored in the cloud (external GCS or a different Terra workspace) and you will use the Data Uploader (recommended) to upload them to the staging workspace, you will reference the file name only in your TSV file. See the blue-shaded cells in the example below.

Example biosample.tsv (what you submit to Data Uploader)

biosample_id
crai cram donor_id

AC101-2025 AC101.crai AC101.cram AC101

NA12878_2024 NA12878.crai NA12878.cram NA12878

Option 2: Data file objects are/will be stored outside the staging workspace Bucket

Note that in this case, you will need to include the full path (i.e., gs://<bucket name>/<file name>) in the TSV before uploading to your staging workspace.

Example biosample.tsv (data files in external GCS Buckets)

biosample_id	crai	cram	donor_id
AC101-2025	gs://your-bucket-id/AC101.crai	gs://your-bucket-id/AC101.cram	AC101
NA1287_2024	gs://your-bucket-id/NA1287.crai	gs://your-bucket-id/NA1287.cram	NA1287

3.3. Generate table load files (TSV format)

In this step you'll create spreadsheet-like files (TSV format) for each table in your data model. You can format existing tabular data into a TSV file and submit it that way (for example, for phenotypic or demographic donor data), or you can create new spreadsheets/TSV files to put your data into.

Templates and examples spreadsheets

Empty data template spreadsheet (click View raw to download template)
Example data spreadsheet (click View raw to download template)

How to use the templatesClick view raw to download the template or example spreadsheet in .xlsx format. The template includes separate sheets for biosample, donor, file, condition, activity and project tables. The first three (in bold) are required and the second three are recommended. You can also include additional tables to hold data unique to your dataset. You can modify the spreadsheet in the editor of your choice and save each table as tab-separated values (.tsv) or tab-delimited text (.txt) to upload to the staging workspace.

A video walkthrough of generating a load file (TSV format) from a template is available below.

Formatting requirementsUnallowed characters
Your spreadsheets may only contain numbers, letters, colons (:), dashes (-) and underscores (_). No special characters (&, $, %, #, etc.) are allowed in any fields of the load file. If your files contain special characters (i.e., “%” or “*”) , you must remove/replace them before ingestion.

Each table must start with a column titled [tablename]_id
Each row in the table must have a unique foreign ID key. These keys will be used to associate data in different tables (i.e., biosample_id, donor_id and file_id). IDs should not contain ^ as a character.

Multiple values, where allowed, should be in json formatting
For example, an associated field with 3 values would look like: [value1,value2,value3].

How large can your tables (TSVs) be?

Terra can generally handle tables of up to 100, 000 rows without any impact on performance.

Please contact us if you have tables with more than 100,000+ rows. AnVIL users should use your support person, if known. Otherwise contact us through the contact us form in Terra (Main menu > Support).

Example: Donor table in a spreadsheet editor

donor_id	biosample_id	donor_type	age	hdl	height	ldl
HG000096	bio12389021	human	76	89.34	179	124.81
HG000097	bio70538276	human	64	62.25	159	120.32

Associating Data in Different Tables

The key, or ID, column is used to associate data (link tables). For example, the biosample is associated with its donor by the donor_id column in the biosample table (blue shade) or the biosample_id in the donor table (greed shade).

biosample_id	anatomical_site	apriori_cell_type	biosample_type	disease	donor_id
bio123890	blood	abnormal cell	Blood	leukemia	HG000097

Where possible, try to include data in the donor, biosample, or file tables If that’s not an option, the data can be submitted as additional, separate tables.

Any data beyond these minimal required tables must always be linked to either the donor_id, biosample_id, family_id, or file_id - depending on what the data element describes. For example, to link data in an additional table to a donor, make sure to include a donor_id column.

Addressing repeated elements

Please bring any repeating data elements (i.e., multiple values for a given data element for an individual) to the attention of the AnVIL team to ensure proper modeling and submission.

Examples of repeating elements

An individual in a data set (“donor”) has a measurement (e.g., blood pressure, lab test, BMI) taken at multiple time points.
An individual in a data set (“donor”) is affected by multiple disease/phenotype/conditions included in the study (e.g., an individual in a diabetes study has both diabetes and diabetes retinopathy; both are being tracked in the study).

Save as "Tab-Delimited Text" or "Tab-Separated Values"

Save each table TSV as a separate file in tab-delimited text or tab-separated values format. Your spreadsheet editor may give you a warning about losing data, but we assure you, it's fine!

Table names

In general, AnVIL will ignore the name you give the TSV file and will use the key name in the first column header (the part in front of the _id) for the table name when you upload tables to the staging workspace.

TSV versus TXT File Extensions

Depending on what spreadsheet editor you use, when you save in the proper format your spreadsheet may have either a ".tsv" or a ".txt" extension. AnVIL will accept either one.