In this step, you'll prepare for data ingestion by generating TSVs for all tables in your data model and organizing all required data and metadata in a format compatible with AnVIL.
PrerequisitesHas your data been approved?
This step assumes that the study has been approved by the AnVIL review board to be hosted in AnVIL through dbGaP. See Step 1 - Register Study/Obtain Approvals for more details and step-by-step instructions.
Have you set up your Data Model?
See Step 2: Set up a data model for help creating your Data Dictionary, which defines all the tables in your data model.
Submission data overview
AnVIL accepts two types of data Data object files (omics and images)
Examples of object files are genomic and other omics files, and images. Note that in addition to the data files, AnVIL requires minimal metadata for all object files, some of which is generated by the AnVIL (i.e., full path to the files in AnVIL cloud storage).
Tabular data (hold clinical data including phenotypes and object files metadata)
- Biosample, Donor and Files tables from your data model
- Submitted in TSV or CSF format (see requirements below)
Most studies are submitting both.
To prepare data for submission, you will
- Make sure all object files (omics data/images) conform to AnVIL’s naming requirements
- Generate a spreadsheet-like TSV file (TSV format) for each table in your data dictionary (from Step 2: Set up a data model.
Step 1: Register for a Terra Account
Before your data is ingested to AnVIL (TDR), you’ll organize and store all tables and object files from your data dictionary in an AnVIL data-deposit workspace on Terra. You will need a Terra Account to stage data in the data deposit workspace.
If you do not already have a Terra account, you’ll find step-by-step instructions in How to register on Terra (Google SSO.
Registering for a Terra account is free and the AnVIL pays all costs associated with uploading and storing your data.
Note that to complete an analysis, you will need to connect a Google Billing Account to your Terra account. See How to set up billing in Terra (GCP) for more information.
Step 2: Make sure object files conform to AnVIL naming requirements
You may provide AnVIL with object files such as image files and genomic and other omics data (VCFs, CRAMs, BAMs, IDATs, or FASTQs). You’ll stage these in a deposit workspace for ingestion in Step 4: Stage Data.
Object file name formatting requirements
-
Unallowed characters
Object files may only contain numbers, letters, colons (:), dashes (-), and underscores (_). Special characters cannot be used in any field or file name. If your files contain special characters (i.e., “%” or “*”) , you mustr emove/replace them before ingestion.
Functional Equivalence (FE)
To maximize the value of AnVIL-hosted data and minimize batch effects in cross-project analyses (Regier et al., 2018), CCDG and TOPMed consortia have defined a functional equivalence (FE) standard for alignment and processing of whole-genome sequencing data (i.e. WGS). AnVIL strongly encourages submitting FE-compliant genome and exome sequencing data aligned to GRChB38. (See the CCDG pipeline standard).
FE is important for downstream joint calling across datasets, but is difficult to prove. There is no easy way for AnVIL to validate or have the submitter prove that submitted data were aligned and mapped on a FE pipeline.
If you are unsure of whether or not your data is functionally equivalent, the AnVIL ingestion team may reach out to you to review your dataset prior to submission.
Step 3: Generate table load files (TSV format)
You'll create spreadsheet-like files (TSV or TXT format) for each table in your data model (see the AnVIL Data Dictionary for an example). You can format your data into a TSV file and submit it that way, or you can create a new spreadsheet to put your data into the schema format.
A video walkthrough of generating a load file (TSV format) from a template is available below.
Formatting requirementsUnallowed characters
Your spreadsheets may only contain numbers, letters, colons (:), dashes (-) and underscores (_). No special characters (&, $, %, #, etc.) are allowed in any fields of the load file. If your files contain special characters (i.e., “%” or “*”) , you must remove/replace them before ingestion.
Each table must start with a column titled [tablename]_id
Each row in the table must have a unique foreign ID key. These keys will be used to associate data in different tables (i.e., biosample_id, donor_id and file_id). IDs should not contain ^
as a character.
Multiple values, where allowed, should be separated by '^'
For example, an associated field with 3 values would look like: value1^value2^value3
.
Example: Donor table in a spreadsheet editor
donor_id | age | hdl | height | ldl | population |
HG000096 | 76 | 89.34 | 179 | 124.81 | GBR |
HG000097 | 64 | 62.25 | 159 | 120.32 | GBR |
Associating Data in Different Tables
The key, or ID, column is used to associate data (link tables). For example, the biosample is associated with its donor by the donor_id column in the biosample table.
biosample_id | anatomical_site | apriori_cell_type | biosample_type | disease | donor_id |
bio123890 | blood | abnormal cell | Blood | leukemia | HG000097 |
Where possible, try to include data in the donor, biosample, or file tables If that’s not an option, the data can be submitted as additional, separate tables.
Any data beyond these minimal required tables must always be linked to either the donor_id, biosample_id, family_id, or file_id - depending on what the data element describes. For example, to link data in an additional table to a donor, make sure to include a donor_id column.
Addressing repeated elements
Please bring any repeating data elements (i.e., multiple values for a given data element for an individual) to the attention of the AnVIL team to ensure proper modeling and submission.
Examples of repeating elements
- An individual in a data set (“donor”) has a measurement (e.g., blood pressure, lab test, BMI) taken at multiple time points.
- An individual in a data set (“donor”) is affected by multiple disease/phenotype/conditions included in the study (e.g., an individual in a diabetes study has both diabetes and diabetes retinopathy; both are being tracked in the study).
Data Indexing
Note that AnVIL will generate a global unique ID (GUID) for object files in AnVIL cloud storage to add to the TSV (crai_path
and cram_path
in the figure below) when they ingest the data files.
Your TSVs should include columns for object file metadata (crai_path
and cram_path
), but AnVIL will provide the drs://dg part.
file_id | biosample_id | crai_path | cram_path | donor_id |
bio123890 | bio123890 | drs://dg.4503:dg.4503/a60 | drs://dg.4503:dg.4503/af9 | HG000097 |
The File table (above) organizes object files associated with samples from the biosample table.
Why use GUIDs
- Allows easy access to data across AnVIL tools, without creating additional copies or transferring across environments.
- Facilitates interoperability with other data commons due to their extensibility.
- Enables tracking of live data being processed in workflow pipelines, and data backup to cold storage.
Step 4. Save as "Tab-Delimited Text" or "Tab-Separated Values"
Your spreadsheet editor may give you a warning about losing data in this format, but we assure you, it's fine!
Table names
In general, AnVIL will ignore the name you give the TSV file and will use the key name in the first column header (the part in front of the _id) for the table name when a dataset is ingested.
TSV versus TXT File Extensions
Depending on what spreadsheet editor you use, when you save in the proper format your spreadsheet may have either a ".tsv" or a ".txt" extension. AnVIL will accept either one.