Once you have prepared your omics object files and generated TSV files for each table in your data model, follow the directions below to deposit the data object files and all TSV files into AnVIL-owned data deposit workspaces. Note that AnVIL on Azure instructions are in a second tab.
-
This demo video walks through the steps to set up your data deposit workspace. For an overview article of the Data Submitter's process, see AnVIL (GCP): Data Submitters' guide.
-
This demo video walks through the steps to set up your data deposit workspace. For an overview article of the Data Submitter's process, see AnVIL on Azure: Data Submitters' guide.
Step 1: Log into Terra/AnVIL
1.1. Go to https://anvil.terra.bio/#workspaces and log in with your Terra ID. If you have logged into Terra before, you will use the same login ID.
Logging in: Google or Microsoft SSO?The Terra login is a Single Sign-On (SSO) that stands in for your Terra ID and doesn’t determine what cloud platform you work in. You can use your Google ID (or G-Suite email) or Microsoft ID to log into Terra. If you already have a Terra ID, you should use that.
For more details, see How to use Terra on Azure with a Google login. If you don’t already have a Terra account, see How to register on Terra (Google SSO) or How to set up/register in Terra on Azure (Microsoft SSO).
1.2. The AnVIL Data Ingestion team will provide you with your submission workspace. Once logged into , search for your data deposit workspace in Your Workspaces and click on the link to open it.
Step 2: Set up cloud storage
To facilitate ingestion into TDR, the workspace cloud storage needs to have a particular directory structure. Follow the directions below to set up the proper (GCP or Azure) storage hierarchy.
-
2.1. Click the file icon in the bottom left column of the Data page to expose the workspace cloud storage directory.
2.2. Click the New Folder icon at the top to create directory folders.
Top-level directory (one dataset/consent group)
-
data_files folder
It should contain the data file objects organized in whatever structure you desire. TDR has no preference, though a common structure is a directory per sample. -
tabular_data folder
Houses the tabular metadata or phenotypic tablesThis folder should ONLY contain TSVs that will be made into data tablesPreviously, these were the files that submitters would update into the workspace data tables. If they have phenotypic data that isn’t in a flat, tabular form and shouldn’t end up in a data table, that shouldn’t be included in this folder.
Example Cloud storage directory structure (single dataset/consent group)
Note that AnVIL deposit workspaces have a notebooks directory by default.
Top-level and sub-directories (multiple datasets/consent groups)
-
Dataset (phsID 1, Consent Code 1)
- data_files
- tabular_data
-
Dataset 2 (phsID 1, Consent Code 2)
- data_file
- tabular_data
Example Cloud storage directory structure (multiple datasets/consent group)
-
data_files folder
-
2.1. Click the file icon in the right sidebar to expose the workspace cloud storage directory.
2.2. Click the New Folder icon at the top to create directory folders.
Top-level directory (one dataset/consent group)
-
data_files folder
It should contain the data file objects organized in whatever structure you desire. TDR has no preference, though a common structure is a directory per sample. -
tabular_data folder
Houses the tabular metadata or phenotypic tablesThis folder should ONLY contain TSVs that will be made into data tablesPreviously, these were the files that submitters would update into the workspace data tables. If they have phenotypic data that isn’t in a flat, tabular form and shouldn’t end up in a data table, that shouldn’t be included in this folder.
Example Cloud storage directory structure (single dataset/consent group)
Example Cloud storage directory structure (two datasets/consent group)
Top-level and sub-directories (multiple datasets/consent groups)
-
Dataset (phsID 1, Consent Code 1)
- data_files
- tabular_data
-
Dataset 2 (phsID 1, Consent Code 2)
- data_file
- tabular_data
-
data_files folder
Step 3: Upload TSV files
Next you will upload your tabular data (TSVs) to the appropriate directory in the deposit workspace. This includes clinical and phenotypic data as well as metadata - all the tables from your data model.
TSVs files to copy to tabular_data
- All clinical and phenotypic data
- All TSV load files in the Data Model
- biosample.tsv
- donor.tsv
- file.tsv
- diagnosis.tsv (optional)
- family.tsv (optional)
- project.tsv (optional)
- Any additional optional tables
-
Tools for uploading files If your files are smaller than 5GB (per file), you can upload files directly in AnVIL/Terra.
For large numbers of large files, you can use the gloud storage command line in a terminal. See How to move data between local and workspace storage (Option 2: gcloud storage) for step-by-step instructions.Upload all TSV files to the tabular_data folder or sub-folder.
See How to move data between local storage and workspace Bucket (small numbers of small files: Upload with Terra’s File Manager).
-
Tools for uploading files If your files are smaller than 5GB (per file), you can upload directly in Terra.
For large numbers/large files, use the Microsoft Azure Storage Explorer App or AzCopy in a terminal. See Bring your own data to Terra on Azure for step-by-step instructions.Upload TSV files to the tabular_data folder or sub-folder
See How to bring your own data to Terra on Azure (for TSVs, you'll usually be able to use option 1: Upload with Terra’s File Manager).
Step 4: Upload object data files
Last, you'll upload unstructured data files (omics files, images, etc.) to the data_files folder or sub-folder.
Large object files staged in data_files
- Omics data
- Images
Data Indexing
Note that AnVIL will generate a global unique ID (GUID) for object files (staged in the data_files folder) to add to the file.tsv (crai_path
and cram_path
in the figure below) when they ingest the data files.
Your TSVs should include columns for object file metadata (crai_path
and cram_path
), AnVIL will provide the drs://dg part.
file_id | biosample_id | crai_path | cram_path | donor_id |
bio123890 | bio123890 | drs://dg.4503:dg.4503/a60 | drs://dg.4503:dg.4503/af9 | HG000097 |
-
In AnVIL (GCP), use gloud storage command line in a terminal to copy object files to the data_files directory.
See How to move data between local and workspace storage (Option 2: gcloud storage) for step-by-step instructions.
-
In AnVIL on Azure, use the Microsoft Azure Storage Explorer App or AzCopy in a terminal to copy object files to the data_files directory.
For step-by-step instructions, see How to bring your own data to Terra on Azure (Option 2: Microsoft Azure Storage Explorer App or Option 3: AzCopy in a terminal.
What happens next
Data Release
- Once your staging upload is complete, reach out to the AnVIL Support team, who will send you a submission checklist that will serve as a final sign off for release.
- Once you provide release approval, the AnVIL Support team will coordinate the release with dbGaP, add appropriate access lists (e.g., dbGaP access lists), ingest the data into TDR, and release the dataset to general researchers.
- Upon release, the AnVIL Support team will coordinate the deletion of any deposit workspaces.
Additional resources
Data model resources
- Defining your dataset schema (the tables that hold your data)
- Step 2: Set Up a Data Model
- Managing data with tables
- Overview: Entity types and the standard genomic model
Data transfer resources
Analysis Resources
- How to set up and run a workflow
- How to customize and launch JupyterLab
- Intro to Terra on Azure Quickstart
If you have additional questions, please reach out to the AnVIL ingestion team at anvil-data@broadinstitute.org.